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1960 INTERNATIONAL CONGRESS FOR LOGIC, METHODOLOGY, 
AND PHILOSOPHY OF SCIENCE 


An International Congress for Logic, Methodology, and Philosophy of 
Science will be held at Stanford University from August 24 to September 2, 
1960, under the auspices of the International Union for History and Philosophy 
of Science. The proceedings of the Congress will be organized into the follow- 
ing sections. 

. Mathematical logic 

. Foundations of mathematical theories 

. Philosophy of logic and mathematics 

General problems of methodology and philosophy of science 

. Foundations of probability and induction 

. Methodology and philosophy of physical sciences 

Methodology and philosophy of biological and psychological sciences 
. Methodology and philosophy of social sciences 

. Methodology and philosophy of linguistics 

10. Methodology and philosophy of historical sciences 

11. History of logic, methodology, and philosophy of science 

The proceedings will consist of a number of invited addresses, in addition 
to brief contributed papers. The closing date for submission of abstracts 
of contributed papers is March 1, 1960. Information about membership 
fees and other details of the Congress may be obtained by writing Professor 
Patrick Suppes, Serra House, Stanford, California. 
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The Institute of Mathematical Statistics and the University of Chicago 
have established a series of publications entitled Statistical Research Mono- 
graphs. 

The primary purpose of this series is to provide a medium of publication 
for material of interest to statisticians that is not ordinarily provided for by 
existing media. Among the kinds of publications envisaged are: 

new research results too lengthy for the usual journal article; 

research results of interest in both theoretical and applied statistics; 

expository monographs in particular areas of statistics; 

discussions of statistical problems and techniques in particular areas of 

application. 

Authors are invited to send manuscripts and correspondence concerning 
the series to Leo A. Goodman, Department of Statistics, University of 
Chicago, Chicago 37, Illinois. 
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STATISTICAL INFERENCES ABOUT TRUE SCORES* 


FrrepEriIc M. Lorp 


EDUCATIONAL TESTING SERVICE 


Formulas are derived for unbiased sample estimators of any raw or 

central moment of the frequency distribution of true scores. A general method 

‘is developed for obtaining from each examinee’s observed score a least squares 
estimate of his true score. 


The ordinary ‘‘observed’’ test score frequently can be used in practice 
without thought of separating it into what must logically be its component 
parts: true score and error of measurement. From a scientific point of view, 
however, the entire concern must be with true score; the observed score 
can be of scientific interest only as it leads to inferences and generalizations 
regarding true scores. Some new methods for making such generalizations 
are reported in the present article under two separate headings: ‘Inferring 
the Shape of the Frequency Distribution of True Scores” and “Estimating 
the True Score of Each Examinee.” The methods given here for inferring 
the true-score moments are based on different mathematical models than 
that used in a previous derivation [10]; the formulas presented here are 
considerably easier to apply in actual practice. 

The matrix-sampling model, used in all but the first two sections of 
this article, readily yields a wide variety of basic results of importance to 
test theory. The shape of the bivariate frequency distribution of observed 
scores on two parallel forms of the same test, for example, can be predicted 
from the data provided by either form alone. Again, the relationship between 
true scores on two different tests can be estimated from observed data; in 
particular, the question can be considered whether the true scores have a 
perfect curvilinear relationship. Although it is easy to obtain estimators for 
these and other purposes, it is more difficult to obtain usable formulas for 
their sampling errors, at least in the case of those estimators that involve 
terms above the second degree. (Some standard error formulas for second- 
degree estimators are given in [9].) Since a knowledge of the sampling errors 
of the estimators is important, it is hoped to obtain formulas for more of 
these before further estimation equations are published. 

In the present article, formulas are derived (see Rule 1) for unbiased 
sample estimators of any raw or central moment of the frequency distribution 
of true scores. Explicit and relatively convenient computational formulas 
are given for both raw and central moments up to and including the fourth 


*This research was carried out under contract Nonr-2214(00) with the Office of Naval 
Research, Department of the Navy. 
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order. Methods for fitting frequency curves to a set of estimated true-score 
moments are briefly discussed. 

A general method is developed for obtaining from each examinee’s 
observed score a least squares estimate of his true score. Detailed com- 
putational formulas for making such estimates are given for the case where 
the regression of true score on observed score is assumed to be linear and for 
the case where this regression is assumed to be a third-degree parabola. The 
method can be extended to the case where this regression may be approxi- 
mated by a parabola of any specified degree. 


Inferring the Shape of the Frequency Distribution of True Scores 


The true score of an individual is customarily estimated by the linear 
regression equation ({7], eq. 11:20) 


(1) f£=£+1..(%. — 2), 


where £, is the estimated true score of examinee a, r,, is the test reliability, 
Z, is the observed score of examinee a, and Z is the mean score of the group 
of examinees to which examinee a belongs. Since the relationship between 
£, and 2, in (1) is linear, the frequency distribution of £, is the same as that 
of x, except for origin and unit of measurement. Equation (1), therefore, 
does not prcevide a useful method for estimating the shape of the distribution 
of true scores. 


The Assumption of Normally Distributed Errors 


A first approximation to estimating the true-score distribution can 
be obtained by assuming the errors of measurement, e, to be distributed 
normally, independently of true score, with a mean of zero and a standard 
deviation, c, , which may be estimated experimentally. In this oversimplified 
situation, a very neat relationship (first called to the writer’s attention by 
Professor Max Woodbury) exists between the true-score distribution and 
the observed-score distribution. 

Let ¢;(t) be the characteristic function ([8], vol. 1, ch. 4) of the true-score 
distribution. Since e is normally distributed, its c.f. is exp (— 4t’o2). By 
definition, x, = ¢, + & . A well-known theorem states that the c.f. of a sum 
of two independently distributed variables is equal to the product of their 
c.f.’s. The c.f. of x is thus 


(2) ¢-(t) = $:(t) exp (—$f'o°), 
and the cumulant generating function for z is 
(3) ¥.(t) = log ¢:(1) 


v(t) — 400%. 
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Since the rth cumulant of the distribution of x is the coefficient of (zt)"/r! 
in the Taylor expansion of y,(é), it is clear that the true-score distribution and 
the observed-score distribution have identical cumulants except for the 
second; the variance (second cumulant) of the true-score distribution is 
equal to the variance of the observed-score distribution minus the variance 
of the errors of measurement, as is, of course, well known from other lines of 
reasoning. 

Given the moments of the observed-score distribution and an experi- 
mentally determined estimate of the standard error of measurement, estimates 
of all the moments of the true-score distribution can thus be readily written 
down from standard formulas relating moments to cumulants ((8], eqs. 
3.30 and 3.34). (The shape of a distribution is ordinarily determined by its 
moments. Methods for doing this will be discussed briefly at a later point.) 

The foregoing convenient method for determining the shape of the 
true-score distribution may be a good approximation in the usual situation 
where the observed-score distribution is nearly normal. However, in this 
ease the problem has little special interest, since the true-score distribution 
will then also be approximately normal with known mean and variance. 

Actually, it is clear that when an examinee’s true score is close to zero 
or to n, the number of items in the test, the error of measurement in his 
observed score is likely to have a distribution that is skewed and a variance 
smaller than that for an examinee with a less extreme true score. Some more 
satisfactory assumption about the nature and frequency distribution of the 
errors of measurement is needed. 


The Matrix-Sampling Model 


The matrix-sampling model ({9], sec. 8) completely abandons the assump- 
tion that the errors of measurement are distributed normally and independ- 
ently of true score. Instead, the group of examinees tested is considered as 
drawn at random from a very large population of examinees (“type-1’’ samp- 
ling) and the test is considered as a sample of items drawn at random from a 
very large pool of items (‘“‘type-2” sampling). The error of measurement in each 
test score is, by definition, the sampling fluctuation arising from the type-2 
sampling process; its frequency distribution and other properties are then 
inferred by means of standard sampling theory. Methods for estimating true 
scores are then readily devised. 

It is assumed that the score on the test is either the number of items 
answered correctly, i.e., 22 = >."-; tia , Where x;, = 1 if examinee a answers 
item 7 correctly, and z;, = 0 otherwise; or, more conveniently, the proportion 
of items answered correctly, to be denoted by 


(4) 2 = 
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Unless otherwise specified, the “‘score’’ of a test will hereafter refer to the 
proportion-correct rather than to the number-correct. 

For convenience, this notation suppresses a third subscript, uw, which 
must always be understood: x,,, is the score of examinee a on the 7th item 
in test u. Note that the subscript 7 identifies a unique item only when a 
specified test (sample of items) is under consideration; otherwise, the sub- 
script 7 merely denotes the ith position in any test u. In general there will be 
only one test for which real data are actually available; all other tests are 
hypothetical samples of items drawn at random from a hypothesized item 
pool. 

The symbol £, will denote the operation of taking the expected value 
of a statistic in type-2 sampling, ie., the operation of taking the average 
value of the statistic over all possible samples of n items each. For example, 
E.2;, is the equivalent of 


lim A > aia’: 
u-0 U 


Thus the symbol # can be manipulated in the same way as an ordinary 
summation sign. (Note in particular that the expected value of a sum of 
variables is equal to the sum of the expected values of the variables, and 
that the expected value of a product of uncorrelated variables is equal to the 
product of the expected values of the variables.) 

The expected value over all examinees in the population will be denoted 
by E, . The expected value taken simultaneously over all possible combi- 
nations of n-item tests and over all examinees in the population will be 
denoted by E,, . The average value over all examinees in the group (sample 
of examinees) tested will be denoted by A, (or A,) when the number, N, of 
examinees is very large and by the equivalent operator N~* >-*_,, otherwise. 
When N is very large, the effect of the operators EL, and A, is identical, but 
it seems wise to use different symbols for the two different operations. 

The first problem will be to obtain estimates of the moments of the 
true-score distribution that are unbiased in type-2 sampling. For convenience, 
these will be called type-2-unbiased estimates of the moments of the true 
scores. (By definition, an unbiased estimate is one whose expected value is 
equal to the parameter being estimated.) The formulas for all of these un- 
biased estimates, together with the moments of their sampling distributions, 
can be conveniently obtained by methods developed by Tukey [15, 16], 
Hooke [5, 6], and Robson [13]. The present derivations, however, will be 
carried through by means of procedures and terminology more familiar to 
the average reader. 

Some further notation may be defined at this point. 


True proportion-score of examinee a (his mean proportion-score over all 
possible tests): 
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(5) Ce — EZ. Sa E,2 54 ° 


Difficulty of item 7 (proportion of examinees answering item 7 correctly), 
for the population of examinees and for the sample, respectively: 


(6) w; = Ey, ; Dp: = Axia = x > Sin 
Average true proportion-score = average item difficulty: 
(7) f=4=- Et, = Eye; = EE. = EE i. = Eistis- 
Proportion of examinees answering correctly all of the r items 7, , 7, , --- ,%, : 
(8) Wisigceety = ByXie¥ise °° * Lira » 

Passareete © ABisFire *** Zire » 


g, h, 7, 7 will be used as alternative subscripts to designate items; a and b 
will be used as subscripts to designate examinees. The symbol n'"! will denote 
the product n(n — 1)(n — 2) --- (n —r +1). 

Type-2-Unbiased Estimates of Raw True-Score Moments. Now consider, 
for example, the quantity 


1 n(n—1) (n—2) 


ni) Bs Ae De eis . 


g4ixG 


(The product above the summation signs indicates the number of terms in 
the summation.) The expected value of this quantity in type-2 sampling is 


7 1 1 
(9) Bais > z >: tot = ne Pe p> EE ,£,.)(Ei2i5%;s) 


EEX yo% inX jp 

= Eigags 

= (E,f.)(Ei $5) 

= MyM . 
The steps in the foregoing reduction are as follows. (i) The F, is moved to 
the right and the definitions of 7, and z;; are substituted for these two symbols. 
(ii) The value following the summation signs is the same for any g, 2, and j; 
since there are n(n — 1)(n — 2) terms under the summation signs, everything 
preceding H, cancels. The Z, symbols are moved to the left (the population 
of examinees is assumed to be so large that terms with a = b may be ignored). 
(iii) If a person’s score on the 7th item on a particular test (random sample 


of items) is higher than on other tests, this is no reason why his score (or 
another examinee’s score) on the jth item should be higher than on other 
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tests; hence z;, and z;, (and also z,,) fluctuate independently in type-2 
sampling, and the expectation of their product is the product of their ex- 
pectations. (iv) Z, is moved to the right. (v) u/ is here defined as the rth 
raw moment (moment about the origin) of the true-score distribution. 

The quantity following EZ, on the left side of (9) is seen to be an unbiased 
estimate of u/ui . An unbiased estimate of any product of true-score moments 
about the origin may be obtained similarly by Rule 1. 

Rute 1. To get a type-2-unbiased estimate of the product u! py’, +--+ uly of 
raw true-score moments, (i) replace each u;, (w = 1, 2, +--+ , W) by the symbol 
x with r, different subscripts, (ii) take the r-tuple sum of the resulting product 
of x’s over all r = >.” r, subscripts, no two of which may assume the same 
value, (iii) divide by n'"'. 

For example, an unbiased estimate for nf = ¢ in type-2 sampling is 


(10) Est, wy = : > T, 

where Est, is read “‘a type-2-unbiased estimate of.” Similarly, 
(11) Est, uh = ae LL, 

(12) Est, ui? = af Lr, 





(13) Est, usue My = ™ We 8 os De MinkatM ici M ies M iM iMins x 
n fivtiant *** wire 

If the number, N, of examinees in the group actually tested is large, the 
symbol z in Rule 1 and in equations (9) through (13) may be replaced by p 
to obtain type-2-unbiased estimates from actual sample data. 

Matrizx-Unbiased Estimates. Rule 1 gives type-2-unbiased estimates of 
products of true-score moments, expressed in terms of products of x’s. If N 
is not large, it is desirable to find type-l-unbiased estimates of these products 
of x’s. Such estimates are matrix-unbiased estimates of the products of true- 
score moments, i.e., simultaneously type-l- and type-2-unbiased. In the 
discussion of such estimates, the assumption that N is large will be dropped. 
First of all, the two quantities 


(14) 


are sample means in type-1 samples and hence are type-1-unbiased estimates 
of x; and z;; , respectively: 
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(15) ees Os 
Est, 75; = Di; - 
To take a less simple case, consider the covariance between items 7 and j, 
(16) 8:5 = Dis — DsDi - 
According to the usual formula for the expected value of a sample covariance, 
(17) E,s;; = (N — 1)0;,;/N, 
where 
(18) O35; = Wiz — Ti; . 


Thus it is seen that 





1 rs N 
(19) Ei yy Nei — Pi) = n,(- a pis) 


— Se + Vij; = WT; . 


Now, since the sample mean of the observed test score is 
1 

(20) == 2M» 
it follows from (10), (15), (14), and (20), that 

1 1 . 
(21) Esti. ui = Est, ~ pe Tat DD = Z. 
Similarly, from (11) and (15), 

1 1 

(22) Esti. ws = Bats wisi 2s a i= ni p> Dii - 
Likewise, from (12) and (19) 


vA 
(23) Estia wi” = Est, Tj Dy Do mim 


tj 


1 1 
i ni) N mt pp» (Np.p; — Di). 


Equations (21), (22), and (23) provide estimators of u/ , uj , and y{” that are 
unbiased in matrix sampling. 

In general, estimates that are type-2-unbiased will differ from estimates 
that are matrix-unbiased at most by quantities of order 1/N. Since such 
quantities will be negligible if N is at all large, further consideration will 
center on type-2-unbiased estimates. 
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Computational Formulas for Type-2-Unbiased Estimates. Although the 
type-2-unbiased estimators so far discussed are exact, these estimators are not 
yet expressed in a convenient computational form. More convenient formu- 
lations will now be obtained, retaining quantities of order 1/n, since these will 
be needed for some purposes, but in certain cases neglecting quantities 
O(1/n’). The reason for retaining quantities O(1/n) is apparent from the fact 
that the variance of the observed scores differs from the variance of the true 
scores only by quantities of this order. 

Very large N is again assumed. Further notation will be needed. 


rth raw moment of observed proportion-scores: 
] , 
(24) mo=—>oz. 
nS 
Variance of item difficulties: 


eS ae ero 


% {=i 


sn 


(25) 8 


rth raw moment of item difficulties: 


(26) Mt=1¥ pi. 
nN G=1 
The foregoing quantities are all statistics for the sample of examinees tested 
and are thus computable from the data. 
First of all, z, has a binomial distribution in type-2 sampling. The 
factorial moments ([8], pp. 56-60) about the origin of a binomial distribution 
have a particularly simple form ({8], p. 118): the rth factorial sampling 


moment of z, = nz, is 





(27) wi(z.) = Enz!” = n'"!¢° . 

It is thus clear that 

(28) Est, us = oe E,x,"' + a hans. 
n n 


The quantity on the right is a factorial moment readily computable from 
the frequency distribution of observed scores. Computing formulas for 
r = 1, --- , 6 are given in (35) for convenient reference. Equation (28) may 
be stated as a theorem. 


THEoreM 1. The rth raw factorial moment of the distribution of number- 
correct observed scores, x, , divided by n'"' is a type-2-unbiased estimate of the 
rth raw ordinary moment of the distribution of proportion-correct true scores, £4 . 


Computational formulas for unbiased estimates of various products 
of true-score moments will now be obtained. First of all, by Rule 1, 
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! 
(29) Est, uy” = ae ee) mami, 8 Ti, = = (1’), 
n ti%ieX ¢8° Bir n 
where (1’) denotes an rth order unitary symmetric function ([1], p. 431). 
This symmetric function is readily rewritten in terms of power-sums with 
the help of the tables of coefficients provided in [2]. The rth power-sum of x; 
is n times the rth raw moment of the distribution of the 7, , for which may 
be substituted the computed value of M’ , the rth raw moment of the dis- 
tribution of the p; . The results are given for r = 2, 3, 4 in (36). The necessary 
M’s are readily computed for any given set of item data from the frequency 
distribution of the n observed item difficulties, p; . 
A similar but more complicated procedure may be applied to estimate 
other products of true-score moments. For example, starting from Rule 1, 


n(n—1) (n—2) 
; 1 


Este mint = a De DoD ts 


oxhsi 


(30) 


_ z z Too; + 2 » ® Byo%o) 
The tables in [2] do not provide the last expression directly, but they do greatly 
facilitate what would otherwise be an awkward algebraic task. The reader 
may easily verify the algebraic identity, however, by splitting up the various 
terms in the last expression, e.g., 


he pi Do Fett = pp Ts 
+ 2 hi De iT + 2 Do Mom i De Fees ) 


ont 
and then recombining the results. Since 7,, = 7, , (30) can be rewritten as 
shown in the first equation in (31). The process illustrated in (30) becomes 
an easy one to carry out without the use of tables if all terms O(1/n’) are 
dropped. In (30), the term 2 >>” z,,x,/n'*! is O(1/n’). 

Formulas derived similarly are given in (31) for the expected values of 
every product of true-score moments up to and including those of the fourth 
order, excepting those covered by (36) and by Theorem 1. The last term of 
the first equation in (31) after multiplication by 1/n'*' is only of order 1/n’. 
All such terms are represented in the remaining equations of (31) by the 
symbols O(1/n’). 


, 1 . 
Est, wins = — 5 (nM{ » ip 


n 


— 2 > > wat sad n*>M?? + 2nM}), 
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1 2ag7 , : : 
Bist, ulus = a Uv? — nM) DY Dan 


— n°Mi* — 4nM{ >> > am] + (45) : 


(31) 
Est, us = . 7 (> be Tor) 


4 
n'*! 


— 22M! © Yorn — 4D DY DY tar, + (3s) , 
1 n n n 
Est. jut = ni (nM; > z. Do Tons 


— 3nM{ > e Ton — 3 > > : Toniti) + o(5)- 


The computation of the quantities in (31) that involve summation 
signs can be further simplified, after substituting p’s for z’s. 


(32) 2 po Por Tr ti a A Xechne Ad ta)” = n’m; ° 
0 h 9 


LD Dy PoPs = > Pr. Dy Pear = > Pr D. AckgoTra 


(33) 
=n x PrArrka = N > Dien ; 


where z, = A,2,.2z. is 1/N times the sum of the scores of those examinees 
who answered item h correctly. Similarly 


- > D parry: =n’ 2% ’ 
(34) YE Vp = nm, 
23 > D paw =n’ Dd pai ’ 


where 2, = A,z;,2, is 1/N times the sum of the squared scores of those 
examinees who answered item 7 correctly. 

Equations (35), (36), and (37) summarize for convenient reference all 
computational formulas for type-2-unbiased estimates of products of true- 
score raw moments up through the fourth order. Equations (35), given 
through the sixth order, are obtained from (28) and from the standard 
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formulas ([8], eq. 3.18) expressing factorial moments in terms of the ordinary 
raw moments. Equations (36) are obtained as described in connection with 
(29). Equations (37) are obtained by substituting (32), (33), (34) in (31) 
and dropping terms of order 1/n’. 


Est. wy = m , 


1 
Est, Ms aS (n? m3 le nmi), 
n 
Esty uf = —L (n’mi — 3n*mi + 2nm'), 
n'3 
1 5 
(35) Est, TA = nil (n‘ mo — 6n* mi + 11n’? ms, — 6nm;), 


Est, ui = — (n°mi — 10n‘m; 
n 
+ 35n*mi — 50n’m; + 24nmi), 
Est, pi = —— (n°mi — 1L5n°m 


6 
eS 


+ 85n*m{ — 225n*>mi + 274n?mi — 120nm}). 


Est, we? ee) ee | eM? — nM;), 


(36) Est, ui° 


1 (n'Mf — 3n°M{M; + 2nM)), 


Bits u{t = <5 (n'M{* = 6n'MiM? + 3n' My + Sri MiMi; — 6nM). 


1 
aa (n?>mimi — 2n 2 pz; — nme), 


Este MY Ma 


Est. ui7ui [n’(nmi? — Mi)m; — n>m{* — 4n?mi PAP 
 P 


4 
ni} 


(37) 
Est, yu,” = tes (n*m2? — 2n>mim, — 4n? > 2), 
n i 


(4] 


Est, THT a (n‘ mim an 3n° mim saad 3n? Pe p2;). 


4 
nit! 
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[Computational Note. If computations will involve (37) as well as (35) 
and (36), only the first two terms of each polynomial on the right should be 
used from (35) and (36). Equations (37) do not retain terms O(1/n’), and it 
seems to be better here to discard all terms O(1/n?) rather than to retain 
some when others are discarded.] 

Estimating the True-Score Central Moments. For many purposes, type-2- 
unbiased estimates of the true-score central (about the mean) moments, 
rather than of the true-score raw (about the origin) moments, are desired. 
These are obtained by expressing the central moments in terms of products 
of the raw moments ({8], p. 50) and then replacing these products by their 
unbiased estimators. Thus a type-2-unbiased estimate of the true-score 
variance is 

Est, of = Est, (uh — ul?) = 2, [Acr(a. — 1) — n° M? + nMi] 


2 
n'?! 


(38) 





= 1. Inst — 1 -— 2) + a). 


A matrix-unbiased estimate of the true-score variance can be obtained 
from (22), (23), and (16) if desired: 


1 N 
Est,. 0; = nN —1 dX 2 ¢ is — Pipi) 








N ] n n n n 
(39) = a (2, D8 — pi + > Pi) 
N-I1n 
aS ea Est, o; . 
It is interesting to note that if test reliability is defined by the equation 
, 2 
(40) Ves = "ss = aa , 


then substitution from (38) shows that the test reliability is the same as the 
Kuder-Richardson formula-20 reliability coefficient, 72, as is conveniently 
seen from (({14], eq. 27). 


Estimating the Shape of the True-Score Distribution 


An estimated frequency distribution of true scores may be obtained in 
a variety of ways from the moments. If only the first four true-score moments 
have been estimated, Pearson-type curves may be fitted by the formulas 
given in [3]. If more than four moments are available, a Gram-Charlier or an 
Edgeworth series may be fitted ((8], pp. 147-156). 

The frequency distribution may be fitted to the estimated raw moments 
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that are readily obtained from (35). Better results will presumably be 
obtained, however, by fitting the distribution to the estimated central 
moments obtained by the method of the immediately preceding section. 


Estimating the True Score of Each Examinee 


Suppose that the psychologist is continually testing many groups of 
examinees with randomly parallel forms of some test. Suppose that he makes 
an estimate, ¢, , of the true score of each examinee. Over a period of time he 
wishes to minimize the sum of squares of the discrepancies between his 
estimates and the true values—he wishes to minimize E,,(¢, — £.)?. (The 
problem here is basically similar to that discussed by Robbins in ([{12], p. 
161), but there are important differences and the method of estimation used 
here is different from his.) 


Linear Estimation of True Score from Observed Score 


In the simplest case, it is assumed that the estimate is a linear function 
of observed score. In this case, the problem is to minimize 


S,; = E, (Bz, +C— nm 


by a suitable choice of B and C, which are assumed to be functions of the 
true scores, not of the observed scores. 

Since in type-2 sampling z, is actually a sufficient estimator of ¢, , it 
might be thought that S, would be minimized by setting B = 1 and C = 0. 
This is not the case, however. 

The necessary derivatives are 


(41) ot = Qh ye(Be, + C$), 
(42) O81 = 28,,(Be, + C — ). 


Set (42) equal to zero, take expected values, and solve for C to obtain 
(43) C = &(1 — B). 
Substitute (43) in (41) to obtain 


(44) (Ey 222 - FE ,.2,)B = Ey2tt. — CE 22, . 
By (28), 

] 7 2 J Sao 1 , 1 , 
(45) Byes = 29 Bualaa” + 2) = "wh +o a 


Since Fy.2.f. = Eif.H22. = us , (44) becomes 
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, 72 2 
iad n 
B an M2 My 


= back Ge = 








(46) n 





at sae is ed — £) 5 | 
n—1 (n — loz + Fl — §) 

Since ¢, &, and of are unknown, they must be replaced by sample esti- 
mates. These will have a type-2 sampling variance of order 1/n, and will 
therefore probably differ from the true values by an amount of order 1/ ‘Vn. 
Such a substitution into the last expression in (46) will change the value of 
B only by O(n™*”), because of the n — 1 in the denominator of the second 
fraction. If unbiased estimates from (35), (36), and (38) are used, the esti- 
mated value of B is found to be 

__n [,_2-)+8/m- |. 
~ ae [i ns, + ns,/(n — 1) 

This quantity differs from the Kuder-Richardson formula-21 reliability 

coefficient ([4], p. 225) 
v1 —# 
(48) T21 => a E —_ a=] 


n—l1 








only by quantities of order 1/n”: 
n E _ al —2+8/(n - »| 


n—1 ns, + ns,/(n — 1) 





Ta 





i o(,): 
~ (n= I)si[m — 1s; + 85] \n’ 

Consequently, r., may be used as an estimator for B. It should be noted that 

Tq, ANd foo differ by O(1/n). Hence rzo is not an adequate estimator of B in the 


present situation, since r,; also differs from 1.00 by O(1/n). 
The linear equation for estimating an examinee’s true score from his 


observed score is thus seen to be 
(49) % =2+Be.-2 o &£ =£#+B,- 2%. 


This last equation is the same as (1) except that B now appears in place of r,, . 


Curvilinear Estimation of True Score from Observed Score 


A virtue of the present approach is that it can be extended to the case 
where the regression of true score on observed score is curvilinear. This case 
is likely to arise when the true-score distribution in the group tested is 
substantially different from the usual symmetrical bell-shaped form. This 
situation will arise not only whenever the group of examinees is a peculiar 
one, but also whenever the test administered is too hard or too easy for the 


group. 
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Figure 1 


Regression (heavy line) of true score (¢) on obtained score (z) when true score has 
a dichotomous frequency distribution 


Figure 1 presents an extreme example to illustrate why a rectilinear 
relation cannot always be expected. The situation in Figure 1 is that half of 
the group have true scores at ¢) and half at ¢, . The conditional frequency 
distribution of the observed scores for each of the two true-score values is 
indicated by a bell-shaped curve. Since the average error of measurement is 
zero, the mean of each conditional frequency distribution lies on the 45- 
degree broken line, which is the regression of observed score on true score. 
It is graphically évident from the figure that the other regression, the re- 
gression of true score on observed score, must lie very close to the heavy 
ogive-shaped line and must be very sharply curvilinear. 

If the curvilinear regression of true score on observed score can be 
approximated by a rth degree polynomial in z, , the equation of this curvilinear 
regression can be obtained by finding the values of the B’s that will minimize 
S, = E,.(Bo + Biz. + Boz? + --- + B,z, — ¢,)? and by writing the re- 
gression equation as 
(50) f = Bo t+ Beat Beit --- + Bz, 
where the &’s are sample estimates of the B’s. 

When the partial derivative of S, with respect to each B is set equal to 
zero, there results a set of r linear equations in r unknowns which are basically 
the same as the “normal’’ equations used to determine curvilinear regression 
coefficients in conventional situations ([8], eq. 22.20; [11], pp. 429-431). For 
example, when r = 3, the equations found by differentiating S, are 

Eat. = Bo + BE yz. + BE, 228 + BE y228 , 
Ey2fae = BoE 2%, + BE 22 + BE 22 + BE 225 ’ 
Eyabta a BoE 226 > BE, 23 + B,E,2 + BE 226 ’ 
Ey2f 2s = BoE i223 + B, E222 = BLE 22, a BE, 2. : 


(51) 
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In order to evaluate F,.z; , it is first convenient to express z, as a sum of 
the polynomials z!"") (r’ = 1, 2, --- , r), as in (45). The expected values of 
each of these polynomials is then readily obtained from (28). Thus, 


1 
E,.2, = n (nus + mut), 


1 
3 {3] [2] {1] 
Ez, = 7 Eia(te + 37%, +2 cp 


] 
a (nus + 3n' us + nui), 


(52) 228 on Fin! nh , + 6n'*! uf , + Tn? yp , 5+ mpi), 


Ey = (nh + 10m! yh + 25m! yh + 15n!? us + nyt), 


E,2. = “a(n! uh i+ 15m" ys + 65n'* ys 


+ 90n'*) us + 31n""us + nyt). 


The numerical coefficients on the right are the same as those obtained when 
an ordinary raw moment is expressed as a sum of factorial moments 


({8], eq. 3.19). 
Similarly, 


Ey 2f 02a vas Ma ] 


(53) Eyabats = 7a (nl uh + rt), 


5 a ld ] , 
E2520 ne (n\* yi c+ 3n" y 3 + mus). 


It is helpful to rewrite (51) by making the substitutions C, = nB, , 
C, = n(B, — 1),C, = nB, , C; = nB, . With a little rearranging, (51) becomes 


CsE 2%, + CoE 2% + CyB 2% + Co = n(Ey2t. — Ey22.) = 0, 

CE y22. + CoB 22, + CyB yee + CoB ite = n(Erabea — E225), 
CE yee + CoE 2%, + CyB oes + CoB yee, = n(Eyofoz, — E22), 
CsE 220 + CoE yxte + CiEy2%e + CoB ieee = n(Byetete — E,2%%). 


The reason for preferring (54) to (51) is that three of the four unknowns 
(B’s) in (51) are quantities O(n~*) and the fourth differs from unity by a 
quantity O(n~*); in (54), all the unknowns are O(1), i.e., are functionally 


(54) 
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independent of the value of n except for terms that become small as n becomes 
large. Furthermore, when (52) and (53) are substituted into the right side 
of (54) all quantities O(n) cancel leaving only quantities O(1). 

Adequate approximations to the solutions of (54) may now be obtained 
by substituting unbiased estimates for the unknown values of the yw’ in (52) 
and (53), substituting the resulting numerical quantities back into (54), and 
solving the resulting simultaneous linear equations for Cy, C, , C. , and C; . 
The exact extent to which terms of higher order in n can be neglected in this 
process is hard to state for any given value of n. There are good indications, 
however, that terms of order n™' should be retained in (52) and (53) even 
when n is as large as 100. 
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A STATISTICAL TEST FOR THE SIGNIFICANCE OF A 
COEFFICIENT OF REPRODUCIBILITY 


Puiuip C. Saar 
PRINCETON UNIVERSITY 


Under the assumption that responses to different dichotomous items 
are statistically independent, exact distributions for coefficients of repro- 
ducibility, CR, are derived. These distributions are useful in testing whether 
an observed coefficient of reproducibility differs significantly from chance 
expectations and whether further scaling manipulations are warranted. The 
effects of purification on sampling distributions of errors are calculated to 
demonstrate a relationship between scaling operations and expected chance 
values of the CR. 


The practice of devising a measuring instrument on the very sample 
that is being used to test a substantive hypothesis involving the measure 
is suspect. The permissive nature of scalogram analysis, the collapsing of 
item categories, the elimination of items if they contribute too much error, 
and the ad hoc determination of positive and negative response categories 
capitalize on the unique and chance features of the sample. The more such 
operations are performed in the process of scaling the more suspect is the 
resultant claim of a scale. At some point evidence should be presented which 
demonstrates (i) that scaling operations were warranted, or (ii) that the 
parallelogram formed by the response patterns to the retained set of ordered 
items is not merely an artifact of the operations performed in the analysis 
and the peculiarities of the sample—but that the items do in fact form a 
scale with the claimed theoretical properties. Such evidence may be gained in 
either of two ways—by replication on a subsequent sample or by the use of 
appropriate tests of statistical significance. Test by replication is neither 
convenient nor possible at all times. - 

One weakness of current scaling practices in sociological and psychological 
research is due to the lack of knowledge about sampling distributions of 
evaluative summary statistics, such as Guttman’s Coefficient of Reproduci- 
bility, CR. The intent here is to present exact distributions and exact tests 
for the CR under certain conditions and to discuss tests of significance for 
other evaluative statistics. This is not an original problem. Guttman provides 
an estimate of the upper bound to the standard error of the CR, given certain 
assumptions ((3], p. 77). This estimate is based upon an intuitive appeal to 
the ‘‘sampling distribution of ordinary proportions (which follow the binomial 
distribution) especially when the population reproducibility is high (say, 
over .90)”’ ([4], p. 279). Loevinger recognizes the existence of the problem 
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([6], p. 37). The current practice of reporting the chance value of the CR, 
under the assumption of complete independence, implicitly assumes the 
existence of a sampling distribution of the CR. Further evidence of the 
type mentioned can be readily cited. 


Reproducibility, Homogeneity, and the Concept of Error 


Loevinger and Guttman have approached the problem of measurement 
in a manner that emphasizes the patterns of responses to a given set of 
items. In the case of dichotomous items with common content, as judged 
by face validity, items are ordered by their difficulty. The ranking by difficulty 
is equivalent to the ranking of items by frequencies of positive endorsements. 
The pattern of responses to ordered items is then compared to an ideal 
pattern of responses. Loevinger’s Coefficient of Homogeneity and Guttman’s 
CR are summary indices designed to reflect the totality of divergences from 
ideal patterns. If a response pattern for items ordered from “hard” to “easy” 
is + — + + — — +, Loevinger would assign the number 7 as a measure of 
divergence and Guttman would assign the number 3. The numbers 7 and 3 
describe the number of errors that would be associated with the observed 
pattern by the respective systems of measuring divergences. It is clear that 
the definition of error is somewhat arbitrary and is imposed on the data. 
Working rules for counting errors, when dichotomous items are ordered from 
hard to easy, are, respectively: 

I. Loevinger—the number of (+ —) patterns among all pairs of items; 

II. Guttman—the minimum number of substitutions of + for — or 

— for + in order to transform the observed pattern into some ideal 
pattern or scale type. 

The two rules have a great deal in common. Each involves the counting 


of (+ —) patterns. The rule for Guttman scales may be viewed as the count- 
ing of (+ —) patterns for adjacent items plus double and triple to n-order 


error patterns among adjacent items. The (+ -+- — —) pattern is a double 
error of the (+ —) pattern, and the (+ + + — — —) pattern is a triple 
error of this (+ —) pattern. The illustrative pattern + — + + — — + is 


recognized as composed of a single and a double error pattern, which total 
to three errors. 

Additional arbitrary rules for counting errors can be readily introduced. 
As an example, and because it appears fruitful, consider a rule III and count 
as errors the (+ —) patterns among adjacent pairs of ordered items. (Green 
(2] anticipated rule III with minor variations. See [8] for additional rules.) 
Thus, + — + + — — + would be assigned the number 2 as a measure of 
divergence. The observation that this arbitrary rule for counting errors gives 
the same results as does the Guttman system of counting errors when the 
number of dichotomous items is three or less and agrees closely when the 
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number of dichotomous items is four, suggests the utility of some form of 
four-fold analysis involving each pair of adjacent items. 


Exact Distribution of Errors under Rule III 


The expression p’ = (r,!r2!c,!c.!)/(alb!c!d!n!) is well known and is the 
probability of the observed four-fold distribution under the hypothesis 
of statistical independence. Row and column totals are treated as fixed and 
not subject to sampling variation. This is also a problem in sampling without 
replacement ([5], pp. 230-231). 

Given K items ordered by difficulty from hard to easy, the probability 
of the observed response patterns, assuming statistical independence, for 
each adjacent pair of items is 

K-1 

IT». 
This follows directly from the assumption of independence, where z denotes 
the four-fold table formed by the 7th and ¢ + 1 item ordered by difficulty 
from i = 1, 2, --- , K. Table 1 is an example of the typical four-fold table 
formed by adjacent items in which c; is the frequency of (+ —) patterns 
and is therefore the frequency of errors. (In the case of a perfect scale, the 


TABLE 1 
Typical Four-fold Formed by Adjacent Items 





Item 











itl Item i response 
response categories 
categories + 3 
+ 
aj a Pitt) 
: % 4 F2(i+1) 
Hi “3 e 





c, cell contains a zero frequency and therefore >,*-' c, = 0.) In fact 

K-1 

I] pi 
is the probability of exactly c, , c. , :-- , Cx-; errors occurring under the 
hypothesis of statistical independence. Referring to Table 1 while employing 
Fisher’s exact test, the probability of obtaining a frequency in the c, cell less 
than or equal to ¢; is given by the expression p; = pj; + pi; + pi; tees + 
p’,; , and the probability of the total errors being less than or equal to E can 
be expressed as 
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K-1 
(1) Prob (c, + ¢. t+¢; +--+ tex-1 < EV < > (Tx:.:), 


t=1 


where >_., symbolizes the summation for all possible c; such that 


K- 


~ 


C; <E. 


~ 


Exact Distribution of Gutiman’s Coefficient of Reproducibility 


Restricting the argument to four dichotomous items, the probability of 
exactly c, , C2 , c; , G, observed errors among four ordered items when errors 
are counted in keeping with the rule covering Guttman scales is 


, , he a ny —- L ) 
Ip. &( a ~-k L \@A\n,-—a—G 


L=f 
4 3 


where b is equal to the smallest of the three frequencies n; , c, , or m,; — ¢, 
(since G < n;,G < c, and G < m, — ¢,); m, is the frequency of positive 
responses to the ith item; n; is the frequency of negative responses to the tth 
item; G is the number of double error patterns. The product factor in (2) is 
identical to the product factor in (1). The remaining factors of (2) may be 
viewed as a correction for double errors. Formula (2) is readily apparent if 
for each particular array of errors ¢, , Cc, , c; , there exists a sampling dis- 
tribution of double error patterns distributed in the same manner as (+ —) 
patterns for three adjacent items. 

The Guttman rule for counting errors introduces an impressive com- 
plexity to the formula previously derived and a still greater complexity 
for Loevinger’s rule. While it is possible to develop similar formulas for K 
items, it is not practical to do so. The practical solution may be to use rule 
III for the measurement of divergences. 

In Tables 2 and 3 exact distributions are presented for errors counted 
according to Guttman and rule III. Table 2 compares exact distributions for 
the two systems of counting errors given a sample of size 10, four items, and 
mM, , Mz , Mz , M, equal to 2, 4, 6, 8, respectively. Table 3 compares exact dis- 
tributions for the two systems of counting errors given a sample size N = 10, 
four items (K = 4), and all item marginals equal to five (m, = 5). Where 
marginals are all equal, items cannot be ordered by marginals. However, 
we can assume either an a priori ordering or marginals that differ by extremely 
small numbers. These parameter values were chosen for illustrative purposes 
and ease of computation. 

The columns headed cumulative percentages provide the information 
for one-tailed tests of significance for the evaluative statistic CR. Specifically, 
the cumulative percentages indicate the chances of getting a CR as large or 





(2) 
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larger than the one observed if the hypothesis that responses to items are 
statistically independent is true. Rejection of the null hypothesis in favor 
of the alternative hypothesis, the hypothesis that the CR is greater than 
values usually observed if the null hypothesis is true, is taken as evidence 
that the item response patterns are in the direction of a scale. Thus, assuming 
statistical independence and given the parameter values as indicated in 
Table 2, values of CR greater than or equal to .90 occur with probability 
less than or equal to .6570 or .6354, depending on the system of counting 
errors. By definition 
E 
CR =1- NE’ 

In Table 3, the probabilities corresponding to a CR of .90 are .016606 
and .007069. The differences between thé probabilities indicate the similarity 
of results between the two systems of counting errors (at least for four-item 
scales) and the important effect of the item marginals on the distributions. 
In Table 4 the values of F counted in accord with rule III for given probability 
levels under the hypothesis of statistical independence are tabulated. The 
values of NV, K, and m; respectively, are 100, 4, and 20, 40, 60, 80. Comparisons 
between Tables 2 and 4 provide the additional and expected information that 
the size of the sample is also an important consideration. 

While the shape of the sampling distribution varies with N, K, and 
m; (« = 1, 2, --- , K), the distributions approximate normality when N 
becomes large. Goodman [1] has demonstrated this latter point and has 
shown the adequacy of the approximation for a small sample problem. 
For small samples, the approximation seems most warranted when the m;/N 
are all about .5. In the small sample case, radical departures from approxi- 
mate normality obtain with extreme item marginals. Exact distributions, if 
known over a range of values for V, K, and m, , may delimit the situations 
in which Goodman’s approximate method yields accurate probability state- 
ments. The reader is referred to [1] for a more adequate treatment of the 
large sample problem. 


Discussion 


Under the conditions described, exact tests are generally applicable if 
errors are counted according to rule III. For the Guttman rule, we are re- 
stricted to 2, 3, and 4-item scales. There are practical limitations that must 
be considered. When WN is large, greater than 10, and K is 4 or greater, the 
computational labor involved in the use of the derived formulas becomes 
excessive. The utility of the formulas is further limited by the logical argu- 
ment that the correct application of the tests occur prior to ‘‘purification,”’ 
the practice of eliminating items which contribute the greatest numbers of 
errors. The common practice in scaling is to purify and then report the 
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TABLE 4 


Exact Distribution of Errors 
(N=100, m ,=20, m,=40, m,*60, m,=80) 











Number of Relative Cumulative CR 
—* frequencies percentages 
24 -0000006 -000+ +9400 
25 20000277 -003 -9375 
26 0000777 e011 -9350 
27 -0002203 033 -9325 
28 -0005522 -088 +9300 
29 -0012905 +217 -9275 
30 20027770 495 9250 
31 -0055665 1.051 29225 
32 20103488 2,086 -9200 
33 20179269 3.879 29175 
34 -0288232 6.761 29150 





evaluative statistic calculated for the purified scale. Obviously, a test of the 
purified-scale statistic must take into consideration the number and marginals 
of all items prior to elimination. Formulas (1) and (2) may be used to test 
whether a set of items is worth purifying and scaling. The importance of the 
argument cannot be understated. Even if the hypothesis of statistical in- 
dependence is true for responses to some original K’ items, it is likely that 
purification will yield a subset of K items (K < K’) that appear to scale. 


TABLE 5 


Exact Distributions of E Following Purification 
(N = 10, K = 4,m;=5) 











a Guttman's system Rule Il 

errors Hypothetical Cum. - Hypothetical Cum. cR 
E frequencies %e frequencies % 
0 1 -000 1 -000 1,000 
1 50 -003 75 004 -975 
2 1,100 058 2,175 113 2950 

13,625 +739 30,625 1,643 900 

4 90,000 5.243 217,500 12.516 875 
5 304,000 20.440 750,000 50.009 -875 
6 871,500 64.007 1,000,600 100.000 +850 
7 720,000 100.000 2825 


Total 2,000,376 2,000,376 
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Had formulas (1) and (2) been applied prior to purification, the tests of 
significance would have indicated, with predetermined probabilities of error, 
that purification was or was not justified. Illustrative of this point is the 
effect that purification has upon the expected chance value of the CR as 
well as the sampling distribution of £. 

Assuming statistically independent responses to different ordered dicho- 
tomous items, the expected frequencies, c; , for any pair of adjacent items 
may be computed by the formula m,n;,,/N. If, among other equally arbitrary 
standards, the criterion of purification employed is the elimination of an 
item or items yielding observed frequencies c; > m,n,,,/N, the remaining 
items must exhibit frequencies c; < m,n,;,,/N. Further, if it is assumed that 
the original pool of items is sufficiently large, it is inevitable that some subset 
of K items will satisfy the purification criterion. Exact distributions for 
such purified scales are presented in Table 5. Distributions were calculated 
from (1) and (2) subject to the restriction, c; < m,n;,,/N. Purification 
tends to increase the magnitude of the expected CR under the hypothesis 
of statistical independence. 

The expected CR’s following purification are .866 and .848 as compared 
to the unpurified scale statistics of .812 and .797, respectively. Expected 
chance values of CR that are commonly reported do not correspond to the 
operations performed on data and actually underestimate the true expected 
chance value of the CR. Moreover, applications of formulas (1) and (2) to 
purified sets of items increase the chances of falsely rejecting null hypotheses. 
Had some other criterion of purification been invoked, a different distribution 
of errors would have followed. 

Other limitations on the use of these formulas also involve operations 
commonly associated with the scaling process. The stipulation that positive 
and negative response categories be determined by face validity prior to the 
inspection of inter-item relationships is not unlike the restriction that formulas 
(1) and (2) are appropriately used prior to purification. The ad hoc procedure 
of determining sign by examining the direction of association with other items 
under consideration also tends to result in observed c; being less than or equal 
to expected c, frequencies. Similarly, the practice of collapsing item-response 
categories to form contrived dichotomous items is not an admissible operation 
if such collapsing is done on the basis of observed inter-item response patterns.* 

The conditions under which tests of significance apply define prior 
admissible scaling operations. Conversely, different scaling operations usually 
imply different sampling distributions of the evaluative statistic. The impli- 
cation is immediately apparent: either scaling operations are restricted to 
those operations consistent with developed statistical tests, or extensive 
replication is used as a substitute testing procedure. Tests proposed in this 


*Interestingly enough, such ad hoc collapsing exploits any lack of an isotropic 
relationship between adjacent items, a situation that is inimical to the concept of a scale. 
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paper and in [1] merely serve to establish whether responses to a set of ordered 
dichotomous items are worthy of further manipulation. The complete rational- 
ization of the scaling process requires a test designed to differentiate between 
artifact and true improvement of the CR following scaling manipulations. 
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SIMPLE STATISTICAL METHODS FOR SCALOGRAM ANALYSIS* 


Leo A. GoopMANt 
UNIVERSITY OF CHICAGO 


Simple statistical methods are developed to test whether coefficients of 
reproducibility, of homogeneity, or of consistency differ significantly from 
what can be expected if responses to different items are statistically inde- 
pendent. Simple methods are also developed for estimating the variance of 
coefficients of reproducibility when it is not assumed that responses to dif- 
ferent items are independent. These estimates are used to test whether a co- 
efficient differs significantly from any prescribed value, and also to obtain 
confidence intervals for these coefficients. The rationale for the measure- 
ment of reproducibility is also discussed. 


The desirability and importance of developing statistical methods in 
scalogram analysis for the comparison of an observed coefficient (of re- 
producibility) with its expected value, estimated under the assumption that 
responses to different items are statistically independent, has been pointed 
out in [4] and [7]. In the following three sections such statistical methods are 
developed for the following coefficients: (a) Loevinger’s coefficient of homo- 
geneity [10], (b) Green’s coefficient of consistency [6], (c) a coefficient of 
reproducibility given by Green [6], and (d) a coefficient of reproducibility 
given by Sagi [13]. These methods can be used to test whether each of these 
coefficients is significantly different from what can be expected if responses 
to different items are statistically independent. Of course, these methods are 
to be applied to a given coefficient only in situations where it seems appropri- 
ate to use that coefficient [cf. 6, 10, 11, 13]. These methods can be used to 
determine “whether a set of items is worth purifying and scaling”’ [cf. 13], 
and should not be applied, except with — caution, to a set of items that 
have been purified. 

Although the comparison of an observed coefficient with its expected 
value, estimated under the assumption that responses to different items are 
independent, is useful for some purposes, this particular kind of comparison 
may be neither very informative nor useful for other purposes. It may be 
more or less obvious that the responses are not independent and that a 


*This research was carried out at the Statistical Research Center, University of 
Chicago, under sponsorship of the Statistics Branch, Office of Naval Research, and of the 
Social Science Research Committee, University of Chicago. Reproduction in whole or in 
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coefficient does differ significantly from what can be expected if independence 
were assumed. It may be more worth while to compare the observed data 
with some model other than that obtained by the assumption of independence. 
The problem of testing whether a coefficient differs significantly from what 
can be expected under the assumption of independence may not be identical 
with the problem of testing whether the observed data indicate that there is, 
in some sense, a scale. Also, the problem of measuring the size of the difference 
between an observed coefficient and what can be expected in the case of 
independence may not be identical with the problem of measuring, in some 
sense, how scalable the observed data are. In the present paper these problems 
are discussed briefly, and methods are presented for estimating the variance 
of Guttman’s coefficient, or of a number of other coefficients, of reproduci- 
bility [cf. 13] when the assumption of independence is not made. These 
methods lead to the development of tests of whether an observed coefficient 
of reproducibility is significantly different from any prescribed value. Also 
described are methods of obtaining confidence intervals for coefficients of 
reproducibility that can be used for the kinds of situations discussed in 
[6, 10, 13]. For the sake of completeness, the situation in which there is a 
random sample of items also is considered. 

Finally, a numerical comparison is presented between the method 
suggested in the following section for testing whether Guttman’s coefficient 
differs significantly from what can be expected if independence is assumed 
and the method suggested in [13] for performing this test. One of the serious 
limitations of the method suggested in [13], which was mentioned by the 
author, is that “‘when N (the number of respondents in the sample) is large, 
greater than 10, and K (the number of items) is 4 or greater, the labor of 
computing --- becomes excessive.”’ The method described here ig very much 
simpler to use than that given in [13], and it will be appropriate when the 
number of respondents in the sample (i.e., the sample size) N is large. It can 
be used without any real difficulty, for any number K (either small, moderate, 
or large) of items. 

The methods presented in the present paper are approximations in the 
sense that they depend on certain asymptotic results, i.e., as N approaches 
infinity the methods become less approximate. Many other statistical methods 
are also approximations in this same sense. The numerical comparisons suggest 
that, even when the sample size is quite small, the methods presented lead to 
rather good approximations in some particular cases. A modification of these 
methods, which leads to still better approximations for small samples, is also 
suggested. It would be valuable to compare the various approximate methods 
presented here in many other particular cases in order to determine how 
large a sample must be drawn for these methods to lead to useful approxi- 
mations; this would be worth while with regard to the approximate methods 
presented here as well as to many other statistical methods (cf. [2], p. 328). 
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A Coefficient of Reproducibility 


The coefficients given in [6, 11, and 13] are all defined for an ordered set of 
K items in which there are only two kinds of responses to each item. In these 
articles it was assumed that the ordering of the K items can be obtained from 
a priori knowledge [cf. 11, 13], or be determined directly by arranging “‘the 
items in rank order according to their popularities” [6]. It also will be assumed 
here either that the ordering of the items is known a priori, or that the total 
number of respondents sampled is sufficiently large so that almost certainly 
the rank order according to popularities will be the same as that obtained 
if the total population of respondents were studied. In the latter case, it is 
assumed that the rank ordering according to popularities obtained for the 
total population is unambiguous (i.e., that the “‘popularities” of two different 
items will not be exactly equal for the total population) and that this rank 
ordering corresponds to the “true” ordering of the K items. The situation 
where there are more than two kinds of responses to each item was com- 
mented upon in [6, 7, and 9], and it will not be discussed here. 

For K = 5, the response pattern (+—-+-+-—) indicates that the re- 
spondent gives positive responses to items 1, 3, 4, and a negative response to 
items 2, 5. The following response patterns would be ideal, indicating a 
“perfect scale” (ef. 4]: (--—-—— (=, (ee, 
(-—-+++), (-++++), (+++++). An observed response pattern 
may differ from the ideal patterns; there are several ways of measuring this 
“error.” 

In [13], the total error e’ is obtained by comparing item 7 with item 
iz + 1, counting the number of (+ —) patterns observed for these adjacent 
items (i.e., the number f;,;,, of individuals who answered positively on item 
4 and negatively on item 7 + 1), and summing these counts for all 7; i.e., 


K-1 
e = rh fevsar - 


t=1 
For K < 3, this method of computing the total error is identical with Gutt- 
man’s method [ef. 13]. Since the coefficient of reproducibility is based directly 
on the observed total error, i.e., c’ = 1 — (e’/NK), a method for testing 
whether the observed e’ differs significantly from what can be expected when 
the responses to different items are independent will be given first. If the 
number of individuals who answered item 7 positively is f; , and the number 
of negative responses on item 7 is N — f; = f; , then the expected value 
(given f,; and f;,:) of f;,;., under the assumption of independence, is f,f;,,/N. 
Also if the assumption of independence is true, using the usual approach to 
the problem of testing the null hypothesis of independence in a 2 X 2 con- 
tingency table (or the null hypothesis that the expected value of the difference 
between two proportions, obtained from independent samples, is zero), it 
can be seen [cf. 12] that the distribution of the statistic [f;,.4. — (fFi41/N)}°/ 
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(fFifiadisi1/N*) will be approximately a x’ distribution with one degree of 
freedom, as N approaches infinity. Furthermore, the distribution of the 
statistic [fi.cu. — (fdiaa/N)VOSdisdia:/N*)'” will be approximately a 
normal distribution with zero mean and unit variance, if the assumption of 
independence is true, and N — o (ef. [12], p. 71). It can also be seen that, 
if the assumption of independence is true, the distribution of the statistic 


¥ Ceaser — Geden/N)] 


(5 fddesTin/N) 


will be approximately a normal distribution with zero mean and unit variance 
(see Appendix). Thus, the null hypothesis of independence of response can 
be tested by seeing whether e’ differs significantly from 





K-1 
EK’ = 2 fibisi/N; 
i.e., by computing (e’ — E’)/S,- = z’, where 


1 


K- 
Si. ait ey fiSdbeaFias/N’, 


and comparing the numerical value of z’ with the standard unit normal dis- 
tribution. For example, an approximate one-sided test, at the 95 per cent 
level of significance, can be obtained by rejecting the null hypothesis when- 
ever 2’ < — 1.645. 

Since c’ = 1 — (e’/NK), 2’ = —(c’ — C’)/S,. , where C’ = 1 — (E’/NK) 
and S,. = S,-/NK. Thus, an approximate one-sided test, at the 95 per cent 
level of significance, can be obtained if the null hypothesis is rejected whenever 
ce’ > C’ + (1.645)8,. . 


A Coefficient of Homogeneity 
Rather than compute e’ as was done above, consider now the total 
error e’’ obtained by comparing item 7 with item j (¢ < j), counting the 
number of (+ —) patterns observed for these two items (i.e., the number 
f;,; of individuals who responded positively on item 7 and negatively on item 


j), and summing these counts for all 7 and j (¢ < j);i.e., 
K-1 K 


e = Zz, .& ae 


t=1 j=i+1 
The expected value (given the f;) of e’’, when responses to different items 
are independent, is 


K-1 K 


Ev =) do fdi/N. 


i=1 j=it+l 
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Loevinger suggests (H’’ — e’’)/E” as a coefficient h of homogeneity [cf. 10, 
15]. A simple test of the null hypothesis that responses to different items are 
independent is based on determining whether h differs significantly from zero. 

By an approach similar to that presented above (cf. Appendix), it is 
seen that, if the null hypothesis is true, then the distribution of the statistic 
(E” — e”)/S,.. = 2” will be approximately a normal distribution with zero 
mean and unit variance, when N — , and 

K-1 K 


Si. = > ss. (f:F:f:F;/N°). 

Thus, an approximate one-sided test, at level of significance .95, can be 
obtained by rejecting the null hypothesis whenever 2” > 1.645. Also, since 
h is such that hE’’/S,.. = 2’’, an approximate test of the null hypotheses at 
level of significance .95 can be obtained if this hypothesis is rejected whenever 
h > 1.645 S,../E”’. (For this hypothesis, the expected value of h is zero.) 

The method presented here is an approximation in the same sense as is 
the method presented in the earlier section. The results of some numerical 
comparisons presented in the final section tend to suggest that the approxi- 
mate test of whether h differs significantly from zero will be quite useful. 
The modification described in the final section to improve the approximation 
when sample sizes are small or moderate can also be applied to obtain from 
S,-. a modified statistic S*., . 


Another Coefficient of Reproducibility 


Now consider one of the coefficients Rep; in [6], suggested as an estimate 
of reproducibility. (The author states that this coefficient was checked 
empirically, and that the average discrepancy between this coefficient and 
the “correct” coefficient of reproducibility was .003 [cf. 6].) For this coefficient, 
the total error e’”’ is obtained as 


K-1 K-2 
ef = De fein + > (fs.csohc-rcan/N; 


and c’” = 1 — (e’”/NK). The expected value (given the f;) of e’’’, under the 
assumption that responses to different items are independent, is 


K- 1 ie K-2 "J a 
Le fifeo Do bfesahe afin 
ant = $= i= ? 

E N + N® 
A simple test of the null hypothesis that responses are independent is to 
determine whether e’” differs significantly from E’”. 

By an approach similar to that presented above, it can be seen that, if 
the hypothesis of independence is true, then the distribution of the statistic 
(E’” — e”)/S... = 2” will be approximately normal in form with zero 
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mean and unit variance when N — o, and 


K-3 
Si... ead Si. + {= (Bi ced ent tas + Bias.c00d"] 


i=2 
+ Sl a3. + ae | 


where 
E;,; = tF,/N and Si. = tFdiF,/N’. 


(This result is derived in the Appendix.) Thus, an approximate one-sided 
test, at level of significance .95, can be obtained by rejecting the null hypoth- 
esis whenever z’” > 1.645. Also, since c’” = 1 — (e’’/NK), 2” = (c'” — 
C’")/S.-+» , where C’” = 1 — (E’"/NK), and S,.»» = S,.»./NK. Hence, an 
approximate one-sided test, at the level of significance .95, can be obtained 
if the null hypothesis is rejected whenever c’” > C’” + (1.645)S,..-. 

Green [6] also suggests an index of consistency g = (c’” — C’”’)/(1 — C’”’). 
Since g(1 — C’”)/S,..., = 2’, an approximate test of the hypothesis of 
independence at level of significance .95 can be obtained if this hypothesis is 
rejected whenever g > 1.645 S,.../(1 — C’”). (For this hypothesis, the 
expected value of g is zero.) 


Some Different Problems and Methods 


In the preceding sections, all of the formulas for the expected values 
(the E’s) and the S”’s were computed under the assumption that responses 
to different items are statistically independent, and the large sample dis- 
tribution theory results also were obtained under this assumption. Now this 
assumption will not be made, and a somewhat different problem will be 
studied. The methods presented in this section can be applied to Guttman’s 
coefficient of reproducibility, as weil as to a number of other coefficients, 
and to the problem of testing whether the expected value of a given co- 
efficient is some prescribed numerical value. Also, these methods can be used 
to obtain approximate confidence intervals for the expected value of the 
coefficient. 

Consider the situation where there is an ordered set of items, with only 
two kinds of responses to each item. Assume that a sample of respondents 
was drawn at random from an infinite population or, for a finite population, 
that sampling was done with replacement. Since there are 2* different possible 
patterns of responses, the total population could conceivably be divided up 
into 2“ different categories corresponding to these patterns. To each category 
i.e., to each pattern of responses, a number is assigned indicating how different 
that pattern is from an ideal one; e.g., to pattern (++-— — —) the number 
1 is assigned for the method in [13], and the number 2 is assigned for Guttman’s 
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method [cf. 13]. The latter method is referred to as Guttman’s method in 
[13]—this title is also used here although it may differ somewhat from the 
methods in [7] and [8]. The different methods lead to different coefficients of 
reproducibility. The number assigned to an ideal pattern, e.g., (——- —+-+), 
is zero. Hence, for any given method of assigning numbers, the tth pattern, 
t = 1, 2, --- , 2*, will be assigned a number X, , and the population co- 
efficient of reproducibility is 
Zi x: 


Jig } an 

where D,/N = P, is the proportion of the population in category t. The 
coefficient in [13] and also the coefficient obtained using Guttman’s method 
are both of this general form; Green’s coefficient Reps, i.e., c’’’, differs some- 
what from this form, but his coefficient Rep, is also of this general form 
[cf. 6, 13]. All of the coefficients mentioned here of this general form have 
the property X, < K/2 for all t; e.g., the maximum value of X, , for these 
coefficients, occurs when the pattern is of the kind (+—+-—-+-—), where 
X, = K/2 (for K even). Thus, Y, = X,/K < 1/2, and 


dD. Y, 


ee ee 
’ N 


=1- DP.Y,. 

The form of the coefficient W, presented by Jardine in [9], is quite 
different from the general form described here. The coefficient W measures a 
different aspect of the data, although this coefficient can be used to develop 
still another test of the null hypothesis discussed in the preceding sections, 
i.e., that responses to the different items are statistically independent [cf. 
9]. Jardine mentions that ‘‘a test for W in the non-null case is not available.” 
In the present section, tests in the “‘non-null case’ for coefficients of the 
general form described above are presented. 

If C is understood to be a measure of scalability (or a definition of the 
amount of scalability), then it is clear-that the system of numbers X, (or 
the Y,) is a basic part of this definition. For different systems of Y, , different 
definitions of scalability are obtained. Which system (i.e., which measure 
or definition) will be appropriate depends on the particular context to which 
the measure is to be applied (cf. discussion of measures of association in [5]). 
For example, in some context C*, based on using Y, = 1/2 for all categories 
t that do not correspond to ideal patterns, will be appropriate. This measure 
is closely related to the proportion of the population falling into all categories 
that correspond to ideal patterns, and C* will be useful if the context is such 
that the important factor is this proportion. If prediction of a response 
pattern for an individual is to be based on his total score, and on the ideal 
pattern corresponding to that total score for the set of items, then the 
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proportion of correct predictions (of response patterns) for the population 
will be 2C* — 1. Thus, for some purposes 2C* — 1 (or C*) measures an aspect 
of the reproducibility of the response patterns from the total scores and the 
set of ideal patterns. 

In some contexts, it may be preferable not to use the collection of K + 1 
ideal patterns as the possible predicted patterns, but rather some other 
collection of patterns as the predicted ones. Thus, sometimes the proportion 
of correct predictions of response patterns for the population can be increased 
by using the K + 1 patterns that occur most frequently in the population as 
the collection of possible predicted patterns. Also, for some purposes, this 
collection need not consist of K + 1 patterns; perhaps some larger number 
would be worth while. Reproducibility in the sense described above can 
sometimes be increased if the number of possible predicted patterns is in- 
creased. Hence, reproducibility in this sense can be high in situations where 
C* or some other measure of scalability is low. It might have been preferable 
to use the term coefficient of scalability for the coefficients of reproducibility 
given in [6, 7, and 13]. However, the term reproducibility does suggest an 
operational concept that is worthy of emphasis [cf. 5]. In some contexts, the 
proportion of item responses, rather than the proportion of response patterns 
predicted correctly, will be important. For these situations C* would not be 
an appropriate measure, but some other coefficients might be [ef. 7, 8, 15]. 

An estimate of the coefficient of reproducibility in the case where a 
random sample of N respondents is observed, would be 








a N : 


where d,/N is the proportion of the sample in category ¢. The expected value 
of c is C, and the variance of c is 
> P.Yi — (LU P.Y,)’ 
Ro. t t pee = 


o 
‘ N 





Since Y, < 1/2, 


> P.Y./2-(>P.Y)? (-O/2-(-C) 





2 es 
oS N N 
a-o(c-})_. 


= a = ¢’. 


N 





This gives an upper bound a” for the exact variance o% of the coefficient of 
reproducibility. For some coefficients (i.e., for some assignments of numbers 
Y,), the exact variance o? will be very close to the upper bound @’ (e.g., for 
c*, o2 will be equal to ¢’). For other coefficients, the bound ¢’ may be much 
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larger than o? . For any coefficient, o? will be equal to a’ if P, = 0 for all cate- 
gories ¢ where the assigned numbers Y, differ from 0 or 1/2. When o2 > 0, 
using the usual central limit theorem, the distribution of z = (e — C)/c, will 
be approximately normal with zero mean and unit variance, when N > o~. 
Also 2 = (ce — C)/é., where 


> 4. Yi — (Dod. ¥,)°/N] 
— + 2s 





will be approximately normal with zero mean and unit variance. Further- 
more, Z = (c — C)/e¢ and % = (c — C)/é, where & = (1 — ¢) (c — 1/2)/N, 
will be approximately normal with zero mean and variance < 1. Thus, if 
the null hypothesis is that C is a specified value, then an approximate one- 
sided test, at level of significance .95, can be obtained by rejecting the null 
hypothesis whenever c > C + (1.645) ¢, . A simpler approximate (and 
conservative) one-sided test, at level of significance < .95, can be obtained 
by rejecting the null hypothesis whenever c > C + (1.645)¢. For some 
purposes, a two-sided test may be appropriate; the general method described 
herein also leads directly to these two-sided tests. Using this method, it is 
also possible to obtain one or two-sided confidence intervals for C. Thus the 
interval c — (1.96) ¢, < C < c + (1.96) ¢, is an approximate 95 per cent 
confidence interval for C. Also, the interval c — (1.96) ¢ < C < c+ (1.96) & 
is an approximate (conservative) confidence interval at a level < .95. 

The preceding comments pertain to the situation where a random sample 
of individuals is drawn, and the numerical value of X, can be determined for 
tth response pattern, ¢ = 1, 2, --- , 2%. For a fixed, ie., given, response 
pattern ¢, the number X, is determined by the definition of scalability used, 
and this number is considered fixed. In some situations, it may be worth while 
to consider this number as a random variable zx, rather than as a fixed value 
X, . This will be the case if X, is the number assigned to the tth category 
defined for a given population of K items, and 2, is the number obtained for 
this category when a random sample of k& items is drawn from this population 
of K items (k < K), and the response patterns to these sample items are 
used to compute x, . The case where there is a random sample of items will be 
discussed for the sake of completeness, although in many situations it may 
not be appropriate to consider the actual set of items used as a random sample 
from a larger population. 

In this case, if the coefficient is computed on the basis of the observed 
response patterns to the k sample items, then this coefficient will in general 
differ from the corresponding coefficient for the population of K items. In 
the special situation where the K items form a perfect scale, the coefficients 
would be identical. In fact, the expected value of the coefficient based on k 
items, for all possible samples of k items, will generally differ from the co- 
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efficient based on K items. For example, in the extreme case where a popu- 
lation consists of K = 4 items, and where (+—-+-—) is the only response 
pattern, then the population value of 2C* — 1 = 0, but this measure computed 
from a random sample of two items drawn from the 4 will take on the values 0 
or 1, depending on which two items are drawn. The expected value of this 
measure will be .5 when items are sampled without replacement. In this 
extreme case, C’, C*, and Guttman’s coefficient will all take on the values .5 or 
1 depending on which two items are drawn; the expected value of these coeffi- 
cients will be .75, while the population coefficient is .5. Thus, the coefficients 
of reproducibility computed for a sample of items from a larger population of 
items must be interpreted with great caution. They generally do not lead to 
unbiased estimates of the corresponding coefficients for the larger population, 
and they can give quite misleading results. 

Now consider the set of K ordered items that are used in an empirical 
study as a random set of ordered items; i.e., there is a population of / sets, 
each containing K ordered items, and one of these sets of items is randomly 
selected for the empirical study. Also consider the situation where the 
responses of a given individual 7 to the M sets of items are obtained, and 
1 — >. ;R.Y, = C; is computed for the population of sets of items for this 
individual j, where ,R, is the proportion of sets among the M sets where the 
response pattern for individual j corresponds to the tth category, for ¢ = 1, 
2, --- , 2". Thus, in an empirical study where a single set of K items is used, 
the numerical value c; = 1 — ,Y corresponding to the jth individual in the 
population, where ;,Y = ,;X/K is the number associated with the particular 
response pattern observed for the jth individual can be considered an estimate 
of C; for that individual. 

This c; will be an unbiased estimate of C; and the variance of this estimate 
is >>, ;R.Y? — (>. ;R.Y.)? = jo? . If c; is computed for each individual 
in a given population of N individuals, then >>; c;/N will be equal to the 
coefficient of reproducibility c computed for a single set of K items for a 
population of N individuals. Also,'the expected value of c is )); C;/N = GC, 
which would be the coefficient of reproducibility computed on the basis of 
all M sets of K items and all individuals in the population. The variance of c is 


oy = SO — O/M, 


where c‘” denotes the coefficient of reproducibility computed for the mth 
set (m = 1, 2, --- , M) of ordered items in the population of M sets of items. 
In the special situation in which, for each respondent in the population, a 
set of items is drawn at random from the population of sets (or if the prob- 
ability of response pattern ¢ for individual j is ;R, , and this is statistically 
independent of the response pattern obtained for a different individual), then 
the variance of c will be somewhat different; viz., 
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of = Dd ior/N* = a3/N, 
where @} is the average value of ,oy for the total population. 
For the case in which a random sample of N respondents is drawn from 
a large population the coefficient of reproducibility c computed for the sample 
responses to a single set of items will be an unbiased estimate of C; the variance 
of c will be si, + (¢3/N), where o2 is the variance of C; for the total popula- 
tion of respondents, ¢3, is the average of 


M 

> (cc) —F C.)?/M = (0M 
for all samples (s = 1, 2, ---) of size N, and c‘™ and C,,) are the corre- 
sponding values of c‘” and C computed for the sth sample, by considering 
this sample as a population of size N. Also, in the special case where the 
probability of response pattern ¢ for individual j is ;R, , and this probability 
is statistically independent of the response pattern for a different individual, 
then the variance of c will be (¢7 + o2)/N. In this case, if it can be assumed 
also that ;,R, = R, , thus, C; = Cand ;o; = c+ for all individuals j = 1, 2, --- , 
then the expected value of the coefficient of reproducibility c, computed for 
a single set of K items for a random sample of N individuals, is C; the variance 
of cis a¢/N = a2 . The latter was computed earlier in this section; an upper 
bound for o7 was also presented. 


Some Numerical Comparisons 


In [13] the exact distribution of e’ is presented for some special cases. 
Now the results obtained using these exact distributions will be compared 
with those obtained using the approximate methods presented in earlier 
sections. 

As seen in Table 4 in [13], if N = 100, f; = 20, f. = 40, f, = 60, f, = 80, 
then the null hypothesis that responses to different items are statistically 
independent would be rejected at the level of significance > .95 if e’ < 33. 
At the level of significance > .975, rejection of the null hypothesis would 
occur if e’ < 32. At the level of significance > .99, rejection would occur if 
e’ < 30. 

The approximate method described in an earlier section indicates that, 
at level of significance > .95, the null hypothesis should be rejected if 
2’ < 1.645; ie., if e’ < 40 — (1.645) (3.67) = 33.96. Thus, in this case, the 
approximate test and the exact test lead to the same critical region, e’ = 0, 
1,2, +--+ , 33. At level of significance > .975, rejection would occur if e’ < 40 — 
(1.960) (3.67) = 32.81, ie., if e’ < 32; at level of significance > .99, rejection 
would oceur if e’ < 40 — (2.326) (3.67) = 31.46, i.e., if e’ < 31. 

From the preceding numerical computations, it is noted that the only 
difference between the exact method and the simpler approximate method 
is obtained at the level of significance > .99. Actually, this difference at the 
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> .99 level is due in part to the fact that the exact level of significance of 
the critical region e’ < 30 is 1 — .00495 = .995, while the exact level of 
significance of the critical region e’ < 31 is 1 — .011051 = .989 [ef. 13]. Thus, 
the approximate method leads to quite an accurate approximation in the 
particular case considered even at level of significance > .99. In this case, 
the method is conservative in the sense that it will lead to rejection of the 
null hypothesis whenever the exact method would have led to rejection, but 
it may also lead to rejection in some cases where the exact method would not 
(cf. [14], p. 105). 

From Table 3 in [13], if N = 10, f; = 5forz = 1, 2, 3, 4, then the null 
hypothesis would be rejected, at level of significance > .95, if e’ < 4. At 
level of significance > .99, rejection would occur if e’ < 3. The approximate 
method leads to rejection, at level of significance > .95, if e’ < 7.5 — (1.645) 
(1.37) = 5.24, ice., e’ < 5; also, at level of significance > .99, the approximate 
method leads to rejection if e’ < 7.5 — (2.326) (1.37) = 4.31, ie., e’ < 4. 
Thus, in this particular case where N = 10, the critical regions obtained by 
this method differed by one unit from the critical regions obtained using the 
exact distribution. Here, too, differences between the results obtained by the 
approximate and exact methods are due, in part, to the discontinuous 
nature of the exact distribution (ef. [2], p. 331). In this case, too, the approxi- 
mate method leads to a conservative test. 

For medium and rather small samples, a modification of the approximate 
methods (changing an NV to an N — 1) may lead to improved approximations. 
This modification is based on the fact that, when the null hypothesis is true, 
the statistic 


fisssr — GeSes/N)__ 
fSdindin/INAN — 1}? 


will be approximately a unit normal variate (ef. [12], p. 71), and (e’ — E’)/S 
will also be approximately a unit normal, where 


S*. = |= {fF fissfia:/[N'(N pi | i, 


Thus, for the preceding example, S* = [3(625)/(1000-100)]'” = 1.44, and 
this modified method leads to rejection of the null hypothesis, at level of 
significance > .95, if e’ < 7.5 — (1.645) (1.44) = 5.13, ie., e’ < 5. Also, at 
level of significance > .99, this method leads té rejection if e’ < 7.5 — 
(2.326) (1.44) = 4.15, ie., e’ < 4. Thus, a slight improvement is obtained, 
although it is not sufficient to change the critical regions given earlier using 
the S,, rather than S> . 

Since the approximate methods given here are based, in part, on the 
same kinds of asymptotic results as the usual approach to the problem of 
testing independence in a 2 X 2 contingency table, the reader is referred to 
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[2] for discussion and recommendations concerning the 2 X 2 contingency 
table and also for some references to tables of exact distributions that can 
be of some use in computing the exact distribution of e’ when N and K are 
small. The reader will note that the modification presented in this section 
to improve the approximate methods is not related to the usual Yates correc- 
tion (cf. [2], p. 332). A modification related to the Yates correction is not 
recommended because, for the 2 X 2 contingency table, this correction “has a 
tendency to over-correct” and this “over-correction mounts up” [2], when 
the corrected statistics for each of a number of 2 X 2 tables are combined. 

It has been shown that the approximate method for testing the hypoth- 
esis that responses to different items are independent, and its modified 
form given here, lead to quite useful results in two cases. If N is sufficiently 
large to permit the application of the usual x’ test for independence to each 
of the (K — 1) contingency tables obtained by comparing responses for 
items 7 andi + 1 (i = 1, 2, --- , K — 1), then the approximate method, or 
its modified form, leads to useful results. 


Appendix 
When the responses to different items are statistically independent, 
the statistic (H’” — e’”)/S,... = 2’, can be shown to be approximately 
normally distributed with zero mean and unit variance when N > ©. A 
similar approach can be used to prove corresponding results for the statistics 
2” and 2’. 
The statistic e’’’ — E’” can be written as 


K-1 


K-2 
7 = Ee me ~ a (fiscat — Ey ica.) + z: (fscrahs-1.e+1 Ey +28 s-1,641)/N 
i=2 


t=1 


K-1 K-2 
— X A: i+. + > (fi .i+2As-1,641 + Bij-1,¢+1Ai,s+2)/N, 
where A;,; = f;,; — E;,; . The expected value (given the numbers f;) of 
e’’”’ — E’” will be zero, since the expected value of A;,; will be zero, and the 
statistics f;;,, and A,_,,;,, will be statistically independent (given the f;). 
(e’” — E’")\N"'” = w may also be written as 


K-1 K-2 
wn’S Aisa + om (EB; c42As-acat + Bik car Aa.cas 


i=] t=2 


+ dr. erste eo d/M) 


Each A;,;N~'” = »,; will asymptotically normally distributed with zero 
mean and an asymptotic variance (for fixed number f;/N) of S?, = fiFif;f;/N* 
(ef. [12], p. 71; [3], p. 366). Thus, 
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K-1 K-2 
w= Me ous + > Ie (Es c+ De-1.641 + Bins c4i0s,c42 
i=1 i=2 
1/ 
+ 0; ,642%5-1,041.N *\/N, 
K-1 


= > Se + [v, sHo,s + Ux-2,KEx-3,K-1 


t=1 
K-3 


+ De vi cr Bsa es + Eisa1.c+3))/(N -}- 5), 
where 


K-2 


6= Z % Maw. 

i=2 
Since 6 will converge in probability to zero, the statistic w is essentially a 
linear combination of the v;;’s (cf. [8], p. 254). The v,,;’s are asymptotically 
independent—this follows from the assumption of independence of the 
responses, or it could also be proved using the methods given in [1]. Hence 
the statistic w is essentially a linear combination of asymptotically normally 
distributed independent random variables, and this statistic will also be 
asymptotically normally distributed. The asymptotic variance (for fixed 
f:/N) of this statistic will be 


K-1 
2 2 
Serr = > iy ee 
i=! 
K-3 
2 72 2 72 2 2 
Si 39.4 + Sk-2,nEx-s,n-1 + > Sy .c+2(Bieaicar H Bias scas) 
i=2 
+ ae ~N? 





REFERENCES 


[1] Anderson, T. W. and Goodman, L. A. Statistical inference about Markov chains. 
Ann. math. Statist., 1957, 28, 89-110. 

[2] Cochran, W. G. The x? test of goodness of fit. Ann. math. Statist., 1952, 23, 315-345. 

[3] Cramer, H. Mathematical statistics. Princeton: Princeton Univ. Press, 1946. 

[4] Festinger, L. The treatment of qualitative data by scale analysis. Psychol. Bull., 1947, 
44, 149-161. 

[5] Goodman, L. A. and Kruskal, W. H. Measures of association for cross classifications. 
J. Amer. statist. Ass., 1954, 49, 732-764. 

[6] Green, B. F. A method of scalogram analysis using summary statistics. Psychometrika, 
1956, 21, 179-188. 

[7] Guttman, L. The basis for scalogram analysis. In 8. A. Stouffer et al., Measurement 
and prediction. Princeton: Princeton Univ. Press, 1950. Pp. 60-90. 

[8] Guttman, L. On Festinger’s evaluation of scale analysis. Psychol. Bull., 1947, 44, 
451-465. 

[9] Jardine, R. Ranking methods and the measurement of attitudes. J. Amer. séatist. 
Ass., 1958, 53, 720-728. 














LEO A. GOODMAN 43 


[10] Loevinger, J. A systematic approach to the construction and evaluation of tests of 
ability. Psychol. Monogr., 1947, 61, No. 4. 

[11] Loevinger, J. The technic of homogeneous tests compared with some aspects of “‘scale 
analysis.’ Psychol. Bull., 1948, 45, 507-529. 

[12] Pearson, E. 8. and Hartley, H. O. Biometrika tables for statisticians, Vol. 1. Cambridge: 
Cambridge Univ. Press, 1954. 

[13] Sagi, P. C. A statistical test for the significance of the coefficient of reproducibility. 
Psychometrika, 1959, 24, 19-27. 

[14] Walker, H. M. and Lev, J. Statistical inference. New York: Holt, 1953. 

[15] White, B. W. and Saltz, E. Measurement of reproducibility. Psychol. Bull., 1957, 
54, 81-99. 


Manuscript received 4/3/58 
Revised manuscript received 6/11/58 














PSYCHOMETRIKA—VOL. 24, No. 1 
MARCH, 1959 


HIERARCHICAL FACTOR SOLUTIONS WITHOUT ROTATION 


RosBert J. WHERRY 


OHIO STATE UNIVERSITY 


A method is presented for securing a hierarchical factor solution which 
achieves simple structure at each hierarchical level without rotation or even 
preliminary arbitrary orthogonal or oblique solutions. The method is based 
upon the assumption that if overlap is removed from clusters the remaining 
specifics will achieve simple structure automatically, The problem presented 
earlier by Schmid and Leiman, using oblique simple structural rotation as 
a basis, is reworked by this new approach. 


Schmid and Leiman [1] recently published an article on the develop- 
ment of hierarchical factor solutions. Their method was an extension of a 
method proposed by Thomson [2] a number of years ago. It consisted of 
factoring and rotating the factors at the first-order level to oblique simple 
structure. The factor correlation matrix was then obtained by inverting the 
reference vector correlation matrix and normalizing it. This factor correlation 
matrix was in turn factored and the process was repeated at this higher level. 
This cycle was repeated until a single factor explained the highest level factor 
matrix. A method of back-solution was then provided which yielded a trans- 
formation matrix to be applied to the next lower level oblique simple structure 
loadings for each order involved. 

The author, with the assistance of Dr. B. J. Winer, had worked out this 
identical extension independently. The method was still considered too 
cumbersome and indirect, however, and the search for a more direct solution 
continued. With suggestions from several sources, particularly from Dr. 
Roderick Bare, the author succeeded in eliminating the need for oblique 
cluster loadings, their transformation to orthogonality, the subsequent 
rotation to simple structure, and the need for the reference and factor corre- 
lation matrices. 

The present method starts with the multiple group factor method, and 
obtains the usual cluster sums and the cluster correlations. It is assumed 
that if all overlap is removed from the clusters then they will have simple 
structure with respect to each other. Hence the multiple group method is 
applied immediately to the cluster correlation matrix and a new set of higher 
order cluster sums and cluster correlation matrix is obtained. This process 
is repeated until the final cluster matrix consists of a single cluster. 

It is now assumed that this final cluster represents a general factor 
with loadings on all of the original tests or variables, and that the specifics 
at that level represent sub-general factors possessing simple structure with 
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respect to each other and having loadings on all tests contained in the lower 
order clusters which compose them. Each cluster, since it represents a postu- 
lated reference vector, is assumed to have unit variance for purposes of 
computing the specific loadings; of course, communalities are used in the 
multiple cluster factorization. The data at this final stage will consist of: 


a. the highest level unitary factor cluster matrix, a set of loadings for that 
factor, and a specific factor loading for each lower order cluster represented 
in the matrix; 

b. one or more lower order analyses for each of which the following are 
available: 

1. the cluster sums for each still lower order cluster (in the lowest order 
these are cluster sums for each original variable), 

2. the multipliers obtained by taking the reciprocal of the square root 
of the sum of cluster sums for the variables in each cluster, which 
were used to obtain the cluster correlation matrix at that level. 


These data are sufficient for obtaining a hierarchical solution identical to 
that achieved by the Schmid-Leiman approach. 

The following steps serve to complete the hierarchical solution. 

(1) Using the highest order cluster correlation matrix with ones in the 
diagonals and the general and specific factor loadings as criteria, obtain the 
beta weights which would predict these factors from the clusters. 

(2) These beta weights are then multiplied by the multipliers referred 
to in b2 above, and these products are used as a transformation matrix. 

(3) These transformation weights are used to multiply the cluster sums 
of the previous analysis, which yields general and sub-general factor loadings 
for each of the next lower order clusters. 

(4) The communality is computed from these factor loadings for each 
cluster at this new level, subtracted from unity, and the square root is taken 
to obtain a specific factor loading for the cluster. 

(5) Steps (1) through (4) are repeated for each successive lower order 
matrix until the original correlation matrix for the variables is involved. 

This method has been successfully employed in both actual factoring of 
correlation matrices and as a modification of the Wherry-Winer [3] method 
for factoring large numbers of items. It has been used in several studies at 
Ohio State University and in several industrial concerns. To illustrate the 
method, it has been applied to the fictitious matrix analyzed by Schmid and 
Leiman. The solution is presented in Tables 1 through 5. 

Table 1 contains the original intercorrelations (Roo) and communalities 
(for this problem assumed to be known). By inspection, each successive pair 
of variables is seen to constitute a separate cluster (,C’;), indicated by spacing 
in the table. Beneath the table are found the six sets of cluster sums (,>_,C’). 

Table 2 gives the sums of the first-order cluster sums (,)>) >> C,C.), 
the reciprocals of the square roots of the major diagonal terms (,M,,,), the 
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TABLE 1 


Intercorrelations of 12 Variables with Communalities in the 
Diagonals, and Six Sets of Cluster Sums for Each Variable 





Intercorrelations, Roo 











Cluster A B Cc D E F 
Variable 1 2 k 4 5 6 7 8 9 10 ll 12 
1 6400 7200 3136 2688 0983 0491 1290 0369 2903 1613 0645 0753 
2 7200 8100 3528 3024 1106 0553 1452 0415 3266 1814 0726 0847 
3 3136 3528 4900 4200 0753 0377 0988 0282 2222 1235 0494 0576 
4 2688 3024 4200 3600 0645 0323 0847 0242 1905 1058 0424 0494 
5 0983 1106 0753 0645 6400 3200 1344 0384 1089 0605 0242 0282 
6 0491 0553 0377 0323 3200 1600 0672 0192 0544 0302 0121 0141 
7 1290 1452 0988 0847 «1344 0672 4900 1400 1429 0794 0318 0370 
8 0369 0415 0282 0242 0384 0192 1400 0400 0408 0227 0091 0106 
J 2903 3266 2222 1905 1089 0544 1429 0408 8100 4500 1458 1701 
10 1613 1814 1235 1058 0605 0302 0794 0227 4500 2500 0810 0945 
il 0645 0726 0494 0424 0242 0121 0318 0091 1453 0810 3600 4200 
12 0753 0847 0576 0494 0282 0141 0370 0106 1701 0945 4200 4900 





Cluster Sums, 12Cic j* 


A 1.3600 1 5300 6664 5712 2089 1044 2742 0784 6169 3427 1371 1600 
B 5524 6552 9100 7800 1398 0700 1835 0524 4127 2293 0918 1070 
c 1474 1659 1130 0968 9600 4800 2016 0576 1633 0907 0363 0423 
D 1659 = 1867 1270 1089 1728 0864 6300 1800 1837 1021 0409 0476 
E 4516 5080 3457 2963 1694 0346 2223 0635 1.2600 7000 2268 2646 
F 13938 1573 1070 u91S 0524 0262 0688 0197 ~3159 1755 7800 9100 





* Sums for variables in cach cluster are underlined. 


cluster intercorrelations (2,,) with communalities in the diagonals. These 
communalities were computed by a modified Spearman equation 


TiFie/Fie ; 


where 7;, and 7,;, represent the average correlation of variables in the cluster 
with all variables not in the cluster. Inspection showed this matrix to contain 
three clusters, indicated by spacing in the table. Beneath the intercorrelation 
table are the three rows of second-order cluster sums Gr. C.C;-). 

Table 3 gives the sums of the second-order cluster sums (>, >, C,C;-), 
the reciprocals of the square roots of the major diagonal terms (,M¢,), the 
second-order cluster correlations (R..), with communalities computed as 
above. This matrix forms a single cluster, and so this table is followed by a 
row of sums, the actual factor loadings, and the appropriate specific loadings. 
This concludes the first or forward phase of the solution. 

Table 4 shows the details of the second or back-solution phase in some 
detail. It contains a Doolittle solution using the second-order cluster inter- 
correlations from Table 3 with ones in the diagonal, and with the loadings 
from that same table used as successive criteria. The beta weights for each 
criterion, obtained by the usual back-solution method, are then postmultiplied 
by a diagonal matrix of multiplier values (.M), also found in Table 3, to 
obtain a transformation matrix (7',). Finally the transpose of this trans- 
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formation matrix is used to postmultiply the transpose of the second-order 
cluster sums from Table 2, which yields the factor structure for the second- 
order domain. The general factor is labeled Z and the three sub-general 
factors are labeled S, , S, , and S, . 

Table 5 presents some of these same steps for the transformation into 
the first-order domain with loadings for the original variables. The first-order 
transformation matrix (7,), obtained by a Doolittle solution using first- 
order cluster intercorrelations, first-order cluster loadings (including specifics) 
in the second-order domain as criteria, and the diagonal multiplier matrix, 
is shown. In addition the final loadings on the General, the three sub-generals, 
and the six group factors are obtained for the twelve original variables by 
postmultiplying the transpose of the matrix of first-order cluster sums for 
each test by the transpose of the first-order transformation matrix. 


Discussion of Results 


Inspection of the final factor loadings in Table 5 discloses that the six 
group factors possess simple structure with respect to one another and that 
the three sub-general factors have this same property among themselves. 
Comparison with the final results obtained by Schmid and Leiman shows no 
discrepancy greater than plus or minus .0001. 

It has been demonstrated that successive applications of the multiple 
group factor method to the original correlation table and resulting cluster 
correlation tables at higher orders will, after proper back solution, yield 
identical results as the Schmid-Leiman approach. . 

The chief virtue of the present method is that it completely eliminates 
the rotation problem by using cluster correlations directly rather than (1) 
rotating to oblique simple structure at each stage and (2) using the inverse 
of the reference vector matrix to go to the next higher order. It even eliminates 
the need for the original oblique cluster correlations, and the rotation of these 
to orthogonality prior to such rotation, since it uses the cluster sums directly. 

The hierarchical solution, being completely orthogonal, is easily used 
to get the residual table. It also presents all factors from all “orders” in a 
single table so that the whole picture can be seen at a glance. Its group 
factors are comparable to the Thurstone primary factors except that the 
present factors are orthogonal. Its sub-general factors bear the same re- 
lationship to Thurstone’s ‘‘second-order” factors, and the general factor, 
in this instance, to a Thurstone “‘third-order’’ factor. Thus the interpretation 
of the factors would be unchanged—with possibly better interpretation of 
the higher order factors since the projections of all of the original tests are 
now explicit on all orders of factors rather than merely on the first-order 
factors. Since the two types of solution are mathematically equivalent, the 
question of whether factors are “really’”’ oblique or orthogonal is unanswered. 
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A generalization of the Bush-Mosteller learning model is proposed in 
connection with two-alternative learning situations with continuous rein- 
forcement. The problem is to test the hypothesis that reward and non- 
reward are equally effective in  SgaRe map learning. Statistically this reduces 
to testing a hypothesis about the value of a single parameter 0, while a set 
of other parameters remains unspecified. The test presented has the property 
of being asymptotically locally most powerful among all tests of the same 
size and asymptotically similar. The application of the test is illustrated. 


This paper is concerned with testing the effectiveness of reward in 
simple learning situations of the following type. A subject is observed over a 
sequence of trials, on each of which he must choose between two mutually 
exclusive responses, A and B. When he responds by choosing A he receives 
a reward, and when he chooses B he either receives no reward or is punished. 
Experiments indicate whatever choice the subject makes on successive 
trials he eventually learns to make response A. Both a rewarded and an 
unrewarded response may contribute to the learning process, but the two 
experiences may differ in their effectiveness towards that end. The problem 
is to determine from a set of experimental data whether reward or non-reward 
has a greater effect on the subject’s learning of that response. 

A typical experiment in this class would be running a hungry rat through 
a simple 7-maze. After going through a short runway, the rat has to choose 
between turning right (response A) or left (response B); the right branch 
leads to a box containing food (reward), the left to an empty box (non- 
reward). These same conditions—the hunger drive and the outcomes of the 
two responses—prevail on all trials. 

One can also set up the maze with food on one branch and an electric 
shock (punishment) on the other. Or, to take a possible variation out of many, 
one might compare the effectiveness of different degrees of reward by placing 
a large amount of food in one box and a small amount in the other. 

In some of these experiments, there may be doubt as to the exact nature 
of the particular response to be learned by the subject, whether, for instance, 
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partment of Statistics, University of California, for their constant assistance and encourage- 
ment throughout the research that led to this paper and during its preparation, and to 
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a hungry rat learns to avoid one box because of an electric shock in it or to 
seek the other because it contains food. Such distinctions are not made here; 
rather, responses A and B are defined in terms of the subject’s overt behavior, 
then the effects of the outcomes of A and B are examined. In the case of the 
hungry rat finding a food box on turning right and an electric shock on turning 
left, a right turn is classified as response A and a left turn as response B. 
The question is whether food or shock is more effective in encouraging the 
rat to make response A on future trials. If a satisfactory answer to this 
question is obtained, it may then be desirable to redefine responses A and B. 
But the problem of interpretation is left strictly within the domain of the 
experimenter. 

Learning situations of this nature were among those treated by Bush 
and Mosteller [1] in the application of their basic model for learning. The 
model presented and used in this paper was actually suggested by the Bush- 
Mosteller approach, but allows for more generality in the trial-to-trial 
transitions. 


The Model 
Assumptions 


The two-alternative learning situation described above consists of a 
sequence of trials on each of which the subject makes a choice between two 
responses, A (the response to be learned) and B. Consider the behavior of 
this subject at the start of trial j in an experiment. While several factors 
may influence his decision in choosing between A and B, they all combine 
to give him some degree of preference for one response over the other. This 
preferente may be expressed in terms of the probability with which he 
chooses one of the two responses. Assume that he chooses B with probability 
q; and A with probability 1 — q; . 

When a decision is reached, the subject acts by making one of the two 
responses, and his behavior at this stage of the trial can be observed and 
recorded by the experimenter. For instance, the experimenter may record 
the result of the subject’s choice on the trial j as z; , where z; is 1 if the subject 
makes response A, zero if he makes response B. 


With every sequence of trials 1, 2, --- , j, --: , two other sequences 
describing the behavior of the subject over these trials may be connected: 
the sequence of probabilities g, , g2 , --* , 9; » *** , of making response B, 
and the sequence of outcomes 2, , 22, -*+ , 2%; , °** , Where x; = 1 or 0. 


Assume that the change in the probability of a subject’s making response 
B from one trial to the next will in general depend on the experience of the 
subject between the two choices; in other words, assume that, for any j, the 
probability ¢;., depends on x; as well as on q; . In particular, given the value 
of q; and the outcome of the trial j, the probability of response B on trial 
j+1lis 
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_ J)W;+19; if the subject was not rewarded on trial j, 
4+1 = | ow, ,,9; if the subject was rewarded on trial j, 


where w;,; is a constant depending upon the combination of factors other 
than the presence or absence of reward which influence the change from 
q; tO q;+, , and where @ is a constant such that 0 < 6 < 1. Since the outcome 
of response A is always reward while the outcome of response B is non- 
reward, this means that 


_ W419; if the subject chose B on trial j, 
%a™ 6w;+19; if the subject chose A on trial j. 


Using the definition of xz; , this relation can be written more conveniently 
as 


(1) Qin = FW 41Q; - 


Unless @ = 1, the probability g;,, depends on whether or not the subject 
was rewarded on the trial j. In fact, @ < 1 implies that after being rewarded the 
subject has a smailer chance of making response B than after a non-reward 
trial. This finding would support the assertion sometimes made that reward 
has a stronger positive effect on learning. On the other hand, @ = 1 implies 
that the effects of reward and non-reward on the behavior of the subject 
are identical. 

Iterating relation (1), 


do = Owe: , 
Qs = 6'wsq. = O° wesq, , 


Ms 


6¢”" *F8* 290) WeWadh , 


and in general, 


+zj-1 


qG= _ aE We °° WiH (j = 2,3, °--). 


u, may be written for gq, , then wu, for wg, , etc., since each of the factors is 
an unknown parameter. In general, 
Uz; = Wr °° Wih - 
Now 
qi = angsion (j = 1, 2, sated 

where Zp is defined as zero, and u; is a constant such that 0 < u; < 1. The 
u; must remain in the unit interval because g; is a probability; the q; are 
taken to be strictly positive to avoid having all probabilities g;,, after the 


trial j equal to zero no matter what the subject learns. 
Since the exponent of @ is the number of rewarded trials among the 
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first 7 — 1, 6 < 1 again implies that the occurrence of reward before trial 7 
decreases the probability of response B on that trial. On the other hand, 
6 = 1 implies that on trial j, g; has a fixed value u; , no matter how many 
previous trials were rewarded. In the remaining part of this section, a set of 
random variables will be defined with distribution depending on 6. The 
question of whether or not reward has a greater effect than non-reward on 
learning will then reduce to the problem of testing the statistical hypothesis 
that 6 = 1 against the alternative that @ < 1. 

Suppose that the experiment consists of m consecutive trials for each 
subject and that it is run independently on r different subjects. Define a set 
of rm random variables X;; (¢ = 1, --- ,7r;7 = 1, +--+ , m) by 


og 1 if subject 7 makes response A on trial j, 
‘? “ \0 if subject 7 makes response B on trial j. 


In the same way as above, let 

qi; = P [subject 7 makes response B on trial j | outcome of previous trials], 
where P[E, | £,] denotes the conditional probability of event F, , given F, . 
Then 


qi; = P[X;; = 0 | Ber 9 °** » Begeal 


Define a set 2 of admissible hypotheses on the distribution of the random 
variables with the following assumptions. 

(i) All subjects have the same initial positive probability on the first 
trial, that is, gq, = u>O(@=1,---,7r). 

(ii) Subjects act and learn independently of each other, that is, the 
X,;; are completely independent for different 7. 

(iii) Assume 

ae 


(2) w@=6 Uj; G=1,--+ 777 = 1, ++: ,m), 
where 0 < 6< 1,0 <u; < l,and 2, = 0. 


The third assumption states that for subject 7 the probability that 
X,; = 0 depends on the sum )-j=; z,;, , which is the total number of A re- 
sponses made by that subject before trial 7. It also depends on @ and, for 
each j, on a different parameter u; . However, the u,; are the same for all 


subjects. 


Relation to the Bush-Mosteller Model 


A direct application to this problem of the basic model introduced by 
Bush and Mosteller ([1], ch. 1) results in an expression for q;; identical with 
(2) in form but involving only three instead of m + 1 unknown parameters. 
Bush and Mosteller assume that, for the subject 7 and any pair of probabilities 
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qi; and q;,;.; , the transition in probability from q;; to 9;,;.,; can be ex- 
pressed in terms of the outcome of the trial 7 and some constant parameters 
of the learning situation that do not depend on j. In particular, their “fixed- 
point form” ([1], p. 29) assumes that 


Qii41 = G55 + (1 — a)A, 


where a and X are constant parameters of the event occurring on the trial j. 
In this learning situation, where response A is always rewarded and response 
B never rewarded, the “fixed-point” parameter ({1], p. 60) is zero, and the 
basic assumption reduces to the expression 


_ Jbqi; if subject 7 chose B on trial j, 
%.4#1 = | 6bg,, if subject 7 chose A on trial j, 


for any j = 2, --- , m, whereO0 < 6 < 1,0 < b < 1, so that 6b is their reward 
and 6 their non-reward parameter. 

Using the random variable X;; already introduced, this probability for 
the subject 7 is 


Qiier. = 0°°'DQ:; (j = 2, +--+, m). 


This relation is identical with (1) in form, with the constant 6b replacing 
w;+, for all 7. Thus, it is a special case of (1) where the trial-to-trial transitions 
in the probability g; are more restricted than in our general model. 

This iterative relation can again be used to deduce a general expression 
for q;; a8 in the above section. In this case, 


G2 = F''bai , 
Gis es 07**bgi2 ae gr ***h re. ; 


and, in general, 


genres tates egs—8 


qi = Qi - 


Assuming that q;, = q > 0 (¢ = 1, --- ,-r), which is equivalent to assumption 
(i) in the model, 


(3) qi; = 0 a ye. 


Equation (3) is the same as (2), with the additional restriction that 
each u; = b’~*g. The parameters u, , --- , Um are therefore replaced in this 
special case by a decreasing geometric sequence gq, bg, --- , b”~*g. It is possible 
to impose a restriction on the parameters u; in the model by assuming that 
U, > Us > +++ > Um, Which would bring it closer to the Bush-Mosteller 
version. It can be shown ([2], p. 19) that the test proposed in the next section 
would still hold in this partially restricted case. However, the results given 
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below were obtained without an ordering restriction being imposed upon 
the parameters. 


Testing the Hypothesis that 6 = 1 


Theoretical Considerations 


In order to state the testing problem more precisely, the distribution of 
the random variables under the set 2 of admissible hypotheses is required. 
The general form for this distribution can be obtained from (2), which gives 
the conditional probability P[X;; = 0| 2, , «++ , 2;,;-:] for any one of the 
random variables X;; as 

P[Xij-0 | Zi 99° Leu] = 6 u; (t= By s* 5 OP Ry , m). 
Starting with this relation, one can verify that the joint probability dis- 
tribution of the rm random variables is the product 


P[X;; = 272i; ; alli and j| Q] 


j—1 


(4) r m 2 e: er we 
Hi - aw oa, 
i=1 j=1 
where 0 < @6< landO <u; <1(=1,---,m). 
The hypothesis to be tested is the assertion that reward and non-reward 
have identical effects on the behavior of the subject. In terms of the param- 
eters of the distribution, this is the hypothesis, 


H:6=1, 


so that in this case the joint probability distribution (4) reduces to 


m x zig mA. x zig 
(5) PIX, =2,; alliandj|H)=[[]Q-—u)" u ™ , 
i=1 
where 0 < u; < 1(j = 1, --- , m). 

Notice that the hypothesis tested, H, is a composite hypothesis, since 
it does not specify the values of the parameters u, , --- , U, . The problem 
is to find a test of this composite hypothesis against the set of alternatives 
that 0 < @ < 1 at a fixed level of significance a. 

Since the distribution of the random variables X,; which is given in 
(4) and (5) is discrete, it will in general be necessary to use a randomized 
test in order to make the probability of rejecting the hypothesis when true 
equal exactly a. With a randomized test, the points of the sample space 
under consideration are divided into three mutually exclusive sets: w, , #2, 
and w; . Each point represents a possible matrix of observed values z,;; and 
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will be referred to briefly as a sample point x. The hypothesis tested is rejected 
whenever the observed sample point z falls within w, and is accepted whenever 
2 falls in w, . When z falls in w, , the hypothesis is either rejected or accepted, 
the decision to reject or accept in such a case being determined, with proba- 
bility specified by the test, by some random process completely independent 
of the experiment producing the observation x, such as the use of a table of 
random numbers. In order to be more precise, the Lehmann-Stein notation 
(3] is used. Define any test 7' in terms of its critical function y(x), with 
0 < vx) < 1, where y(z2) is the probability with which test T rejects the 
hypothesis H when the point z is observed. Thus «, is the set where ¥(z) is 1, 
while w, is the set where ¥(x) = 0, and w, is the set where 0 < y(x) < 1. 
The value of ¥(x) is determined so that E[y(X)| H] = a. Here, a is the 
probability with which the test 7 rejects the hypothesis H; it is also known 
as the size of T. 

In addition to the size of the test, it is also important to consider its 
power. In particular, following the theory of hypothesis testing developed 
by Neyman and Pearson [4], it is desirable that for the composite hypothesis 
H there is a test 7 with the following two properties. 


(i) It is a similar test of size a. In other words, the probability of re- 
jecting the hypothesis H when @ = 1 is equal to a whatever be the values of 
U,,°** , Un in Q This property can be stated symbolically as 


E{y(X)| H] = a for all possible values of u, . 


(ii) The similar test is most powerful in the sense that, when @ < 1, the 
probability of rejecting the hypothesis that 6 = 1 be the largest possible 
among all similar tests of the same size. If H were tested against a simple 
alternative 


K:0= 6’, uj =u, 


where 0 < @& < 1 and with wu,’ a specified set of values of u, , this property 
could be stated symbolically as 


E[W(X) | K] = Ely.(X) | K), 


where y,(X) is the critical function of any other test 7’, having property (i). 


Now, if in addition to being similar, 7’ possesses property (ii) inde- 
pendently of the specified values 6’, ui , --- , u,, , then 7’ would be the most 
powerful similar test against all alternatives in 2. Such a test is known as a 
uniformly most powerful similar test. A uniformly most powerful similar 
test would be desirable in this problem since the values of 6 and uw, , -°* , Un 
are not specified in the alternative against which the test is made. If, on the 
other hand, the similar test of H which is most powerful against a simple 
alternative in Q depends on the value of any of the parameters specified by 
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the alternative, then the test is not most powerful against all the alternatives 
in Q, and there exists no uniformly most powerful similar test of H. 

In the event that no uniformly most powerful similar test exists, one 
might be satisfied with a similar test that is locally best, in the sense that it 
will be most powerful in detecting small deviations from unity in 6. Thus, 
while still requiring that the test define a rejection rule based on the obser- 
vation z which is independent of the values of u, , -++ , Um , one might con- 
sider only “‘local”’ alternatives with @ in the neighborhood of the value tested, 
¢é=1. 

Unfortunately, however, it has been found that neither a similar test 
that is uniformly most powerful against all alternatives in Q nor one that is 
locally best exists ([2], secs. 3.1, 3.2), as such tests appear to depend on the 
values of the unknown parameters. 

An alternative approach is possible by means of a method for construct- 
ing asymptotic tests for composite hypotheses first given by Neyman [5] 
in the case of a single unknown parameter. The method utilizes an estimate 
of the unknown parameter and, allowing the sample size r to increase in- 
definitely, defines a sequence of critical functions ¥,(X) such that, asr— ©, 
the expected value of ¥,(X) under the hypothesis tested tends to the pre- 
assigned level a, independently of the unspecified parameter. A test defined 
by such a sequence of critical functions is called asymptotically similar. 

By an extension of Neyman’s method to the case where the distribution 
of the random variables under the hypothesis tested involves more than one 
unknown parameter, it is possible to construct an asymptotically similar 
test of the composite hypothesis H: @ = 1. Such a test is given in the next 
section and will be referred to as test 7, . 


An Asymptotically Locally Best Similar Test of H 


Following Neyman, the sequence of critical functions y¥,(z) will be 
derived from a set of r independently distributed functions of the observations, 
say ¢; , --- , ¢, , through a number of simple transformations and with the 
use of a set of consistent estimates for the unspecified parameters u, , ++ , Um. 

The functions {, , --- , ¢, must satisfy certain general conditions of 
finiteness and differentiability ({2], ch. 3.3), which are easily met. However, 
of all the possible sets of functions ¢, , --- , ¢, satisfying those conditions, 
one selects a suitable set such that the power function of the resulting asymp- 
totically similar test would have some optimum local property. For this 
purpose, first consider a modified hypothesis in Q; in particular, the hypoth- 
esis 


H,:@=1; Pie es 


where u} is a specified value of u,; (j = 1, --- , m). Now look for the locally best 
size a test, say 7’) , of the hypothesis H, against the alternatives 














MARY I. HANANIA 61 


K:0< 1: ©, 02+, ah. 
That is, in the class of all size a tests of H, against K, the power function of 
the test 7, must have numerically the largest slope at @ = 1. 
Such a test can be easily obtained from the probability distribution 

(4) by application of the Neyman-Pearson Fundamental Lemma [4], and is 
given by 

¥(z) = 1 whenever L > k, 

y(z) = 0 whenever L < k, 


0 < (xz) < 1 whenever L =k, 
where 


6) pa Pao nS. 


0 
i=1 j=2 FG; k=1 





and k is so chosen that E[y(X)| Ho] = a. 
In close relation to L, the set of r functions ¢; now is defined as 


m s=1 
(7) i = ot ee = 1, - ; 
j=2 v; k=l 

where v; = 1 — uw; for all 7. For each 7, the function ¢; is seen to depend on 
the unknown parameters wu, , --+ , Um aS well as on the set of observations 
2i1, *** , Zim, Which may be considered as a single observation on the random 
vector X; = (X;,; , --- , Xim). Under the hypothesis tested H: @ = 1, the 
joint distribution of the r random vectors X, , --- , X, is given by (5), from 
which it is clear that the vectors are independently and identically dis- 
tributed with the probability distribution 


(8) PIX, = 2 |%,-** ,t. ; A] = TT fay?" (¢ = 1, +--+ ,#). 
It follows that the functions ¢, , --- , ¢, as defined in (7) must also be mutually 
independent. 

A critical function y,(z) now is derived from the r functions ¢, , --- , ¢, 


through the following transformations. 
(i) For each 7, let the standardized value of £; be 


tees Eg i) : 

o (¢ ) 
As can be verified, both the expected value E(¢;) and the variance o°(¢,) 
of the function ¢; are finite, and the resulting standardized function is 


™m I J i-l 
> (2 z 0; > ru) 
7 


(9) Zi ii 1=* 


[E-)+ E—] 


k=1 k=1 


2 4= 
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(ii) Now introduce the m functions 
i du; i ‘ 1) > “m 9 


for j = 1, --+ , m, and let 
¢; — Ey;) 
o(¢;) 
be the standardized value of yg; . Then each ¢; (j = 1, --+ , m), is a function 
of X; and uw, +++ , Um. 
(iii) Also define m other functions R, , --- , R, by 


R; = E(Z,). 


(iv) Finally, for each 7, combine Z; , ¢, , «++ ,¢m, and R,, -::, R,, into 
the function 


>; = 


ae > Re; 
(10) 7; ao ee. ee eee 


ee 
g 1— > R; 
i=1 
and take the average 


1 r 
ry = py - De A(X paar 9 ¢** » Hi). 
In terms of the original observations, 


j-1 

-- 2 eaten — 0) 

(11) Y, ae Be m U; i-1 ; 
es es Dd ite 


j=2 Uj k=l 
Equation (11) can be derived from (9) and (10) and the definitions of the 
functions introduced above. 
Y, is clearly a function of the unknown parameters, as well as of the rm 
observations z;; (¢ = 1, --- ,r;j = 1, --- , m). Replacing each u, in Y, by 
an estimate u* for 7 = 1, --+ , m, gives the corresponding function 








P= yD Lee cut, s+, ud), 


whose definition is independent of u, , --- , Um. A suitable set of estimates 
can be found by the maximum likelihood method, which gives 


C; 1 <x 
ear? LZ 


r é=l 


as the maximum likelihood estimate of 1 — u; . 











MARY I. HANANIA 63 


Therefore take 


is) 


2 


1 — ut =v% = "9 whenever c; ~ 0, 


! 


vt = - whenever c; = 0. 


The provision in the latter case is introduced in order to avoid division by 
zero in Y*., 

Substituting these estimates for wu, , --+ , Um (Or Y;, , +*- , Ym) in (11) and 
simplifying, 


r m as ; +s 
pe > = tin — — Da (m — je; 


(12) ee 


1 m a P $—-1 
“ »» ma c.(r — Cy) 


where c; = 0 is replaced by 1/r. 

Under the hypothesis H the probability distribution of this function 
Y* can be shown to approach the normal with zero mean and unit variance 
asr— o ((2], ch. 3.3). 

Now the test 7’, is defined as follows. Compute Y* from the data; choose 
a level of significance a and reject the hypothesis H with probability y,(z), 
where 








¥(z)=1 if Y,>k., 
v¥(z)=0 if Y,<k., 
O<¥,a) <1 if Y,=k., 
and k,, is defined by 





ke 
wal ce? dt=1l—-a. 
Tv —-o 


This test has the following properties. 

(i) By definition of Y* in (12), Z[y,(X)| H] is independent of the un- 
specified parameters u, , °° , Um. 

(ii) Since the distribution of Y* approaches N(0, 1) under H asr > o, 
the size of the test E[y,(X)| H] — a, the required significance level. 

(iii) By comparison of the asymptotic distributions of Z in (6) and 
Y* in (12) the test 7, may be shown to be asymptotically equivalent to 
the locally best test 7 . 

Therefore test 7’, is asymptotically a locally best test of H against the 
alternatives K: 0 < 1; u,, +++ , Um, and asymptotically similar of size a. 
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Application of Test T, 
Given is a set of rm observations z;; (¢ = 1,--- ,r;j7 = 1, --- , m), where 
r is the number of subjects in an experiment of m trials, with 


__ J1 if the subject 7 is rewarded on the trial j, 
7+ * ‘0 if the subject 7 is not rewarded on the trial j. 


To test the hypothesis H: 6 = 1 (reward and non-reward are equally 
effective) against the alternatives that @ < 1 (reward is more effective) at 
the level of significance a, apply the following rule. 

(i) From the data, form the sums 


¢= Da; G=1,---,2. 


Then c; is the total number of subjects rewarded on the trial j. Let 
whenever c; ~ 0, 


whenever c; = 0. 


2 
II 


C; 
1 
> 
(ii) Compute 

M= DUG Lee, 


mS 
iia (m — je} , 








Cc’ = 


| 
co] 
~ 
| 
QS 
i 
baw ~. 
iM 
_ _ 
Q 
tie 
— 
™. 
| 
is) 
rf 
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and combine these three quantities to obtain 


M-C’ 


(13) Y= a 


(iii) From a table of the normal distribution function find the upper 
a point, k, , defined by 





(14) sal. Pa fs ae 


Then reject H with probability ¥(x), where 


H2)=1 if ¥>k,, 
(15) Wa) =0 if Y <k., 
0< Wa) <1 if Y=k,. 
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The probability of rejecting the hypothesis H when Y as computed from 
the data is exactly equal to k, should be chosen so as to make E[y(X)| H] = a. 


Illustration 


The use of the test 7’; is illustrated on data from an experiment on 
prediction by F. W. Irwin and L. Dyckman%*, in which the subjects were 20 
women undergraduates. At the start of each trial the subjects were asked 
to predict whether or not a light would go on. The light then went on, thus 
it may be said that a subject who had predicted it would go on was 
“rewarded,” or agreed with. Since the light went on every time regardless 
of the subject’s choice, this is a case of continuous reinforcement. 

Table 1 shows the outcome of the experiment in the form of a matrix 


TABLE 1 
Data of F.W. Irwin and L. Dyckman from a "Humphreys-type"™ Experiment 
with Continuous Reinforcement * 








Trials (j) 
8 91011 12 13 14 15 16 17 18 19 20 21 22 





Subjects (ayy 


no 
a 
» 
a 
a 
a 





1 a ee eG NB i Oe Pek eee eS Se Best 
2 a. Dy Be Ree OO: Oe Bn By Ord! Oo ad pbc eh 
) ee eee ere a eRe ee ee ee eo Oe 
4 2 Se eae ae ae Ok. ee a ee Se ER eS 
5 ae a. ee ee OR OO a Oe OE ee 2 ES Bee 
6 Se ee ee ee ee Oe ee ee 2 eed SD OR 
7 2 9 ae Ae ee ee ee ee ee a OE ee 
8 a A al Ra a ie Ee ae a Ge Se a: Ae eee ae ieee: Caer SE he ne 
9 Soe en ee ee ee Re BO eke SR oe ee 
10 Be ie DS RZ Oe a ae a eae 
11 ee ee, eee a! ka ae Oe a ee ee 
12 Sa eh ee i ea de a a 
15 ee ee ee, ee a ae ae ke em ee ee 
14 pe, AR RP ABE Ra ake a: ie: ae SIR GSES ety: ae: ass GRP Seedy Oe: Grae. ER ee: eg | 
15 2 eee ee Ook: ROR 2 ee TO eS 
16 Se ee ee ae Oe Oe, Re a SS ARS aa 
17 Oe ee me a ee ea a ee eR Be, ee 
18 OO Oe eee a ee ae, a ee OE Be a ae 
19 pO OR ee ae See ie Nee ace: See le: ati: eee eee Ste ie ame Vaio kere “a ae» 
20 a 8 COO Oe 2 br eee ede oe 2A a eB 
Total (ey) 1810 9 9 8 14 13 14 18 17 18 19 19 17 19 20 19 20 20 19 19 20 





» An entry of 1 indicates Xq sl (reward), a zero entry indicates Xq 700 


of observations for 20 subjects and 22 trials, with an entry of one for a correct 
prediction and zero for an incorrect prediction. In this experiment, r = 20 
and m = 22. For each j (j = 1, --- , 22), 


20 
Cc; = Di ) 


t=1 
from which are obtained 
1 20 —c} 
ch 5 20 — c; —— 
a C; ’ A? C; ? 


*Personal communication from F. W. Irwin. 
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and 


tt 


> (20 —c) (= 2,--- , 22). 
k=1 


Then compute 


20 22 a:; j-1 
M == = Lik ’ 
t=1 j=2 C; k=1 
1 21 
C’ = 55 2, (22 — Her, 
22 20 — cj 
Cl en | ao > ¢ a? c(20 — a], 
as shown in Table 2. Then, by (13), 
d al = 4.186 
TABLE 2 


Table Illustrating the Computing of Y in the Irwin-Dyckman Data 














20-c} j-1 20 = jel 1 = jel 
J 22-5 ce 20-c Zcy(20-cy) 2x ae = nF x 
, J e} ua & Kye td “ik ej p.2 sar ik 
1 21 18 2 1111 
2 20 10 10 1.0000 36 9 29000 
3 19 9 11 1,2222 136 14 1.5556 
4 18 9 11 1.2222 235 17 1.8339 
5 17 8 12 1.5000 331 19 24.3750 
6 16 14 6 04286 415 37 2.6429 
7 15 13 7 05385 506 47 3.6154 
8 14 14 6 24286 590 64 4.5714 
9 13 18 2 1111 626 87 4.8333 
10 12 17 3 21765 677 100 5.8824 
11 11 18 2 1111 713 116 6.4444 
12 10 19 1 20526 732 143 7.5263 
13 9 19 1 0526 751 162 8.5263 
14 a 17 3 01765 802 160 9.4118 
15 7 19 1 20526 821 198 10.4211 
16 6 20 0 821 222 11.1000 
17 5 19 1 +0526 840 236 12.4211 
18 4 20 r) 840 261 13.0500 
19 3 20 ry) re) 840 281 14.0500 
20 2 19 1 20526 859 289 15.2105 
21 1 19 1 0526 878 302 15.8947 
22 r) 20 r) 878 339 16.9500 
M= 169.2711 
te 3340 m_ (2355-7072 )1/2 2.2711 
c's =55- = 167.0 C"s (sho) = 0.5426 Y = O*5an6 7 4°1856 


To test the hypothesis that 6 = 1 (reward and non-reward, or agreement 
and contradiction, are equally effective) at the level of significance .05, the 
rule as given by (15) isto reject the hypothesis tested if Y > kos = 1.645. 
Within the framework of the model, the result of the computations in Table 
2 would therefore lead to rejection of the hypothesis that 6 = 1, which 
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implies acceptance of the alternative that reward (making a correct pre- 
diction) had a stronger effect on learning in this experiment. 

Since the above example is given here primarily for purposes of illus- 
trating the computations, it is not necessary to proceed further into the 
interpretations of the statistical test and resulting conclusions. It need hardly 
be emphasized that the validity of the statistical model must be carefully 
considered in each individual experiment before attempting to draw conclu- 
sions from the application of the test. 


Testing Against Other Alternatives 


Up to now, the question under consideration has been whether the 
effect of reward is equal to or greater than that of non-reward. Thus the set 2 
of admissible hypotheses allows @ to assume values either equal to or less 
than unity, but never greater. However, in some experiments there is a 
question whether non-reward, or perhaps punishment, has a stronger effect 
than reward. To allow for this possibility, assume a larger set of admissible 
hypotheses, say 


2:0<6< 4 
where 6, > 1 is defined so that the probability 


qii = 9 Uj; 
remains less than or equal to one for all 7. This is achieved by letting 
6, = 1/max u; . In this new set Q, , the results of a preceding section may be 


immediately applied to the case of testing the hypothesis 
Ao 6 =1 


against the alternatives 1 < @ < 6, . Notice that this situation is a mirror 
of the situation originally discussed above, namely the testing of H: 6 = 1 
against alternatives in which 0 < 6 < 1. It will be natural to expect, there- 
fore, that the test of H, be the exact mirror of the test 7, of H, given in (15). 
In fact, reversing the inequalities of (15), a test of H, against alternatives 
with 6 > 1 is obtained. By considerations of symmetry this test of H, can 
be shown to have the same optimum asymptotic properties as does the test 
T, of H against alternatives with @ < 1. This test of H, , to be referred to as 
T. , is given in terms of its critical function y(x) as 


He = it FT Y <k, , 
(16) v(x) = 0 if Y>k,, 
Osv@<s1 f Y=k,, 

where Y and k, are defined as in (13) and (14), respectively. 
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A Note on the Use of the Asymptotic Tests in Small Samples 


By definition, the asymptotic test has the property that its size, or the 
probability of its Type I error (rejecting the hypothesis tested when true), 
approaches the preassigned value a as 7 — . In practice, therefore, one has 
at best only an approximate test whose size gets closer and closer to a as the 
number r of subjects in the experiment increases. The question then arises 
as to how large must the number of subjects be in order that the probability 
of Type I error of the test be within some reasonably small interval around a. 

A partial answer to this question can be found in a study of the behavior 
of the asymptotic test 7’, in some selected cases in which the sample size r 
is small ([2], sec 5). To some extent the approximation seems to depend on 
the sizes of the statistics c; in the data, where c; = )>‘-, 2; (j = 1, --- , m), 
so that 0 < c; < r. For the case r = 10, m = 4, the asymptotic test gave 
fairly good approximations when none of the c; (j = 1, --- , 4) had an extreme 
value like 0, 9 or 10. For a = .05, the actual probability for Type I error in 
such cases appeared to lie between .01 and .08, being very close to .05 in 
most examples. The approximations became progressively worse when one or 
more ¢; (j = 1, --~ , 4) were given values near zero or ten. 

As expected, the asymptotic test gives a better approximation to the 
desired size of the test as the number of subjects r increases. There are also 
indications of improvements in the approximation as a result of an increase 
in the length m of the experiment. 
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SOME IMPLICATIONS OF 
THE LOGICAL CALCULUS FOR EMPIRICAL CLASSES 
FOR SOCIAL SCIENCE METHODOLOGY* 
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An exposition of a calculus for empirical classes (CEC), one of the few 
attempts by logicians to deal with the problem of constructs and indicators, 
is presented. The CEC provides the groundwork for a formal structure for 
the situation in which individuals have a degree of membership in various 
classes rather than having either membership or nonmembership—a situation 
nearly always true in empirical research. The CEC is presented and its re- 
lation to various social science concepts is mentioned. An application of the 
CEC model to latent structure analysis (LSA) suggests alternatives to the 
local independence assumption including one called the local scale assumption, 
which has a close relation to a Guttman scale. 


Following is an attempt to interpret for social scientists an article by 
Kaplan and Schott [3] on a calculus for empirical classes, and to suggest 
some relations of this article to social science research. The article was written 
by two philosophers. It was intended as a follow-up article to an earlier article 
by Kaplan [2]. 

The Kaplan and Schott work is a generalization of what is known to 
logicians as the calculus of classes or sets. A class may be defined as the 
totality of objects having a certain property. An object is spoken of as 
belonging to, or being a member of a given class as “‘Lazarus belongs to 
the Socialist Party” or “Mendel is an Elk.” Perhaps the reader, especially 
if he is a social scientist, has felt the inadequacy of the conception of class 
membership to his own problems, an inadequacy which rests chiefly in the 
fact that these logical operations require, for their applicability, a certitude 
as to the class membership or nonmembership of empirical cases, a condition 
seldom achieved in actual research work. Only in trivial cases can it be 
said with certainty that ‘‘A is a schizophrenic” or “B is upward mobile.” 

Kaplan and Schott (hereafter referred to as K-S) set out to construct 
a logical system which is precisely analogous to the conventional calculus 
(which they call the calculus of sets reserving the term calculus of classes 
for the new system which they set out to develop) but differs by introducing 
a notion of degree or weight, instead of the certainty, that an individual 

*This work was done while the author was an S.S.R.C. Fellow at Columbia Uni- 
versity working on a project in social science methodology. Much is owed to Professor 
Paul Lasarefeld for frequent discussions and suggestions both prior to and during the 
writing of this paper. Herbert Menzel also contributed valuable critical comments. Pro- 


fessor Abraham Kaplan also aided by reading the manuscript and offering helpful comments. 
tNow at the University of California, Berkeley. 
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either does or does not belong to a given class (set). The combination of 
qualities from which his class is to be inferred they call the individual’s 
profile. (The terms profile and weight will be given precise definitions in 
a following section.) 

The calculus of empirical classes (hereafter referred to as CEC) is felt 
to be worth exploring for social scientists for the elucidation of what are 
called constructs, intervening variables, latent classes, etc. The authors 
attempt to develop a logical calculus (formal model) of classes that ap- 
proximate the empirical classes commonly dealt with in research. As such 
it is one of the few attempts by philosophers to deal with this area. The 
calculus of probability is an integral part of the Kaplan-Schott development, 
just as it is more and more proving to be in empirical work. 

The CEC also has implications for the concept of operational defini- 
tion. The probabilistic character of the relation between a concept and the 
various measures of it is explicated by the CEC. Perhaps the term indicator 
definition will prove more useful than operational definition. This point is 
discussed at greater length in the original article by Kaplan [2]. 


Our problem can be phrased: to construct a calculus (an uninterpreted 
logical system) which will provide a more adequate explication of classes in 
their scientific use than is afforded by the conventional “calculus of classes’’ 
which we prefer to call the calculus of sets. 

The procedure of the paper is to construct ... entities which have a 
degree of vagueness characteristic of actual (empirical) classes, but which, 
when this degree is minimal, correspond to precise sets (those classes for 
which every element either is or is not a member). 


It should be noted that the attempt is to construct a calculus, i.e., a 
logical system not interpreted in terms of empirical correlates, although 
the authors have in mind a specific interpretation, namely, empirical classes. 
In this exposition, this interpretation will frequently be made so that the 
formal structure will not obscure the relevance of the concepts being of- 
fered to social science. 


Preliminary Ideas and Terminology 


K-S defined the field (of inquiry) as the set of individuals with which 
a particular inquiry is concerned. In the social sciences the individuals are 
primarily individual persons or groups. 

Articulation is the partitioning or dividing of a field into a set of cate- 
gories, each of which exhaustively divides a set of individuals into mutually 
exclusive sets according to their specific qualities in the categories in question. 
An articulation then is a set of sets (the categories) of exclusive sets (the 
qualities). For example, a population of army officer candidates (the field) 
may be articulated by use of the variables or attributes (categories), ascend- 
ance, intelligence, age, and length of service. Within each category there 
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are certain values which can be assigned to each officer (thus dividing them 
into mutually exclusive sets), e.g., high ascendance, low ascendance; under 
20 years of age, 21 to 25, 26 to 30, over 31 years of age. A similarity should 
be noted between the notion of an articulated field and the notion of property 
space as elaborated by Lazarsfeld and Barton [4]. 

A category must be exhaustive, i. e., it must assign one quality to 
each individual, and since the qualities are mutually exclusive, the category 
assigns only one quality to each individual. Categories are what are fre- 
quently called variables, atiributes, or dimensions in social science. They would 
include for people: age, sex, amount of education, dominance, dependence, 
friendliness, socio-economic status, etc., and for groups: cohesion, democratic, 
productive, satisfaction, etc. 

Some categories may have few qualities and others many, where qualities 
are mutually exclusive sets of particular categories. Qualities in the social 
sciences are often called values of a variable, scale points, discriminable classes 
along a dimension, etc. Examples are: high, low; very much, some, very 
little; higher, same as, lower; agrees, disagrees. In every case they are mutually 
exclusive, i.e., no individual can have more than one quality. Refinements 
of methods of observation can be taken into account in terms of an increase 
in the number of qualities discriminated. 

To clarify the next terms introduced, an example to be used throughout 
this presentation will be constructed. The field is considered to be all army 
officer candidates at a given army post numbering 1,000, the size of the 
field. This field will be articulated in two ways, called for simplicity psycholog- 
ical and sociological (see Table 1). (In passing, it may be noted how two 
different articulations of the same field may be constructed, that each cate- 
gory applies to every individual, and that the qualities within each category 
are mutually exclusive.) 

A profile is a set of qualities, one from each category in an articulation. 
A profile describes an individual completely with respect to a context of 
inquiry. It corresponds to what in social science is often called a complete 
descriptive category, or typology, or a point in a property space. In sample 
articulations the following would be examples of profiles, or complete descrip- 
tions of individuals: Independent-Friendly-Low Participator (IFL); 21 to 
25-High School-Working Class-Father Army Man (25HWA). It should be 
noted that the profile is the name for the descriptive category—the set of 
qualities; it is not the name for the individuals who fall into the category. 
The set of individuals characterized by a profile is called the unit cell. The 
unit cell represents the limits of discrimination possible within an articulation. 
Individual members of a unit cell are indistinguishable within the articulation. 
They must be identified as distinct individuals by criteria external to the 
articulation. 

In the examples, all those officer candidates who are, e.g., Dominant- 
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TABLE 1 
Two Articulations of the Field "Officer" 








Psychological Articulation 








Categories Qualities Abbreviations 

Dominance Dominant - Independent - Submissive D,1I,S 

Friendliness Too friendly - Friendly - Superficially T, F, 8S, N 
friendly - Not friendly 

Participation High participation - Low participation H, L 





Sociological Articulation 





Age 18-20, 21-25, 26-30, 31-35, 36-40 20 525 530535540 

Education Grade School - High School - College G, HC 

Social Class Lower - Working - Middle - Upper L, W, M, U 

Father's Mili- Regular Army Man - Not Army Man A, N 
tary Status 





Too Friendly-High Participators (DTH) constitute a unit cell. Since there are 
no other categories in the articulation to use for discriminating the members 
of the unit cell, they are indistinguishable. Criteria external to the articulation 
must be used to distinguish them, e.g. proper names. 

To consider a group of individuals alike in some ways but not in others 
a sub-profile is used. This is a profile of a sub-set of categories of the arti- 
culation, in other words, a set of qualities one from each of some categories 
but not all of an articulation. The classification 20-College-Upper Class 
(20CU) in the sociological articulation constitutes a sub-profile since it is 
a profile of a sub-set of the articulation, namely, age, education, and social 
class. Other examples of sub-profiles are: Submissive-Low Participator (SL); 
21 to 25-High School-Middle Class (25HM); Too Friendly (T). 

In analogy with the profile, the sub-profile is the name of the descriptive 
category. Cell is the name given to the set of individuals characterized by 
a sub-profile. The cell has the same relation to the sub-profile as the unit 
cell has to the profile. In the above example all of the officer candidates 
who are classified 20CU would constitute the cell corresponding to that 
sub-profile. The cell would consist of all the unit cells corresponding to 
profiles that had the qualities 20, college, and upper class, regardless of 
their qualities in the other categories, hence the cell 20CU consists of the 
unit cells 20CUA and 20CUN. 
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Elements of the Calculus 


A basie notion to the development of the calculus of empirical classes 
is that of the indicator. The indicator is defined as a function which co- 
ordinates to every profile some real number between 0 and 1, inclusive. Thus 
to every profile (or point in property-space) there corresponds only one 
real number in the unit interval (between 0 and 1, inclusive), but for a given 
number in this interval there may correspond many profiles. For example, 
the profile DFH may have assigned the number 0.61; DFL, 0.33; SFH, 
0.61; STL, 0.94; etc., with each profile having exactly one number cor- 
responding to it but a given number (e.g., 0.61) may have many profiles in 
correspondence. 

The definition given for indicator illustrates the formal nature of this 
logical development. Indicator is defined strictly with reference to logical 
terms such as function, coordinates, number, with no commitment to em- 
pirical interpretation. As will be seen, an empirical meaning will later be 
suggested. 

Every indicator determines a class, where class means the set of couples 
(each profile and the number assigned to it by the indicator) determined by 
the indicator. In the calculus of sets, a class may be defined as all of those 
individuals with a given property, say red. We may state this in CEC terms 
as the set of couples consisting of each red object and the number 1 assigned 
to it by an indicator function and each non-red object in the field and the 
corresponding number 0. The number 1 is taken to mean possession of the 
property red. The CEC concept of ‘class’ would make this a special case 
of the more general case where each object has to a certain degree the property 
red, so, for example, a scarlet object would be assigned a higher number 
by the class indicator than an orange object, and much higher than a yellow 
object. : 

The number an indicator assigns a profile is called the weight from 
the profile to the class specified by the indicator. To illustrate, weights 
are assigned to all possible profiles for the psychological articulation of 
the field of inquiry (see Table 2). 

There are 24 possible profiles. For an example of a class consider future 
army officers (Officer). Define Officer as ‘‘candidate who successfully com- 
pletes officer training.’ Each cell is a profile, and each cell entry the weight 
of the profile to the class Officer. The indicator is the function which assigns 
a number to each profile, i.e., Table 2. 

It is important to note the relation between a category, a quality, and 
a class. The differences seem to be explicable primarily in terms of whether 
or not the class is applied to an articulated or nonarticulated field. For 
example, suppose a field is articulated by the use of the one category Partic- 
ipation (qualities: High Participator, Low Participator) and then we 
apply the class Participation to the articulation. High Participator could 
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TABLE 2 


Profiles and Hypothetical Weights to the Class "Officer" 
for the Psychological Articulation 


























Dominance 
Dominant Independent Submissive 

Friendliness Participation Participation Participation 

High Low High Low High Low 
Too Friendly 29 -18 43 -33 +20 -11 
Friendly -38 +31 89 -14 -87 -14 
Superficially 47 -82 29 -63 yh -38 

Friendly 

Not Friendly -61 -19 -22 -06 -09 52 





then be defined by giving a profile a weight of 0 if it were Low Participator 
and 1 if it were High Participator. However, if the class Popular is used, 
the weights from each profile (High Participator and Low Participator) 
would vary between 0 and 1, since participation and popularity are not 
perfectly related. 

It is obvious that the situation could be reversed by articulating the 
field with the one category Popularity (qualities: Popular, Nonpopular), 
then apply Popular as a class to this new articulation and assign weights 
of 0 or 1 to every profile. Then the indicator specifying the class High Par- 
ticipator would assign weights from each profile between 0 and 1. Thus 
classes are very similar to qualities. The selection of a category and its 
qualities represents an act of choice in which all profiles by definition have 
one and only one quality from each category. Once the field has been articu- 
lated and a class introduced the weight from each profile is determined 
by its indicator function. Hence a quality could be considered a class to 
which each profile has the weight 0 or 1. 

In social science there are similar terms. A category used to articu- 
late the field is often called an independent variable and classes to which 
the individuals are referred are dependent variables. For example, indepen- 
dent variables may be religion of respondent which has the qualities, Prot- 
estant, Catholic, Jew, and income, with the qualities, High, Low. The 
dependent variable “votes Democratic” may then be introduced. A weight 
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would be assigned from each profile to the dependent variable. The terms 
antecedent and consequent are sometimes used in the sense of category 
and class. Also observable and nonobservable variables, predictor and criterion 
variables, and manifest and latent variables are similar, respectively, to 
category and class. 

Hence, in the example, Officer was a criterion or dependent variable, 
but an antecedent explanatory variable like father’s ambition as the class 
could have been chosen equally well. Similarly, latent variables such as 
anxiety could be used. The two cardinal points are (2) there is no restriction 
on type of variable included under class or category, (77) the relation between 
class and category (or quality) is very close, the difference resting primarily 
on the choice of whether the class is applied to an articulated or unarticulated 
field. 

An individual x is an instance to the degree p of a class, if the indicator 
specifying the class assigns to the profile of x the number p. Thus the CEC 
treatment differs from the set-theoretical explication of classes primarily 
in treating every individual as an instance to some degree or other of the 
class rather than taking some individuals to be instances without qualifi- 
cation and others not at all. Classes differ from one another in the degree 
of membership in them assigned to the various individuals in the field of 
inquiry. This fundamental notion of a class, as opposed to a set, may be 
represented diagrammatically. In Figure 1, a circle represents a set or class 





























EMPIRICAL CLASS , SET 


Fiaure 1 
Diagrammatic Representation of Empirical Class and Traditional Set 


and rectangles represent individuals. The shaded portion represents the 
degree of membership of an individual in a class; for sets this degree is either 
0 or 1, for classes it varies from 0 to 1. Thus the set is a special case of a class. 


Operations on Classes 


Operations on classes are defined by rules for the combinations of weights. 
Analogous to this are the rules for combining classes in the set calculus, 
such as product, sum, etc. In the set calculus definitions utilize membership 
or nonmembership in a class or combination of classes. If by contrast indi- 
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viduals are regarded as having membership to some degree in every class 
there must be a reformulation of class operations in terms of combinations 
of weights. 

The guiding principles in formulating the rules for the combinations 
of weights are: (7) classes for which the weights are all 0 or 1 are to have 
the properties of sets; (77) the class connectives (sum, product, etc.) in these 
cases are to reduce to the familiar set connectives; (iit) the weights are to 
correspond as closely as possible to probabilities. These conditions lead to 
interesting field characteristics. Their relation to cumulative or Guttman 
scales is discussed below. 

A note on terminology: K-S use p to designate articulations but here 
numbers will be used in the service of simplicity. \ is the symbol for profile, 
and small letters a, b, c designate classes. The notation 


(1) - W,Q, a) = p 

states that the weight in articulation 1 from profile \ to class a is p. The 
notation K-S use (and followed here) is that of Reichenbach [5] and is read: 
the weight from ) to a is p, or, the weight of a, given \. An example: if 
is the profile 25HLN and a is the class Officer, and articulation 2 is the 
sociological articulation, then 

(2) W.(A, a) = p 

may be read “‘the weight from the profile 25HLN to the class Officer is p’’ 
or, using the probability interpretation of weight, “the probability that 
an officer candidate of profile 25HLN will become an army officer is equal 
to p.”’ Feller’s [1] equivalent notation is 


(3) W,[alA] = p. 


Class Product 


The definition of class product willbe given first since it is presupposed 
in the definitions of most of the other operations. Here, as in the conventional 
calculus, the product of two classes refers to the joint or overlapping class 
of individuals who have membership in both of the original classes simul- 
taneously, e.g., the product of the classes Officer and Popular refers to all 


those individuals who are popular officers. 
In the new calculus of empirical classes class product is defined as follows 


(4) {ec = al\b} =p: {WA,c) = min [W(A, a), W(A, b)]}, for all X. 
Equation (4) may be read “the weight from ) to class c, where c is the joint 


class of a and 8), is equal to the lesser of the two weights (those from \ to 


a and from } to b).” 
The product of two classes is defined as that class whose weight from 
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each profile is the lesser of the weights to the original classes from that 
profile. Suppose the weights from every profile to two different classes are 
known, e.g., Officers and Popular, in other words one knows the weight 
(or probability) of every possible type of candidate (profile) becoming an 
army officer, and also of being popular among the other candidates. The 
question is, how can one compute his probability of being both an army 
officer and popular. 

In the set calculus this probability would, in general, be equal to the 
proportion of candidates who were both officers and popular, in other words, 
the proportion of candidates who had weight equal 1 for both the classes 
Officers and Popular. An individual case is immediately determinable as 
either both weights 1 or not. In CEC the weight of a candidate being both 
an officer and popular is equal to the lesser weight of the individual classes. 
If his chances of being popular are only 0.2, while his chances (interpreted 
weight) of being an officer are 0.7, his likelihood of being both an officer 
and popular is only 0.2. Here it becomes clear that weight cannot in general 
be given the meaning of empirical probability for this definition is very dif- 
ferent from the familiar rule for the probability of the occurrence of a joint 
class which is the product of the individual probabilities assuming the classes 
are independent. 

The only condition under which the above definition of the class product 
as a minimum corresponds to the familiar rules for empirical probabilities 
of a joint class is when one class is included in another. Consider the in- 
terpretation of weight as an empirical probability, i.e., the limit of relative 
frequency. If the properties Officer and Popular are independent, the prob- 
ability of an individual being both popular and an officer is 0.14 (0.7 X 0.2), 
considerably less than the 0.2 called for by the CEC definition of class product. 
The CEC definition, when related to the probability interpretation, posits 
a certain degree of dependence between classes, specifically, that one includes 
the other. In other words, the only conditions under which being a popular 
officer has probability 0.2 is when every popular candidate also becomes 
an officer, i.e., when the probability of a popular candidate being an officer 
is 1.0. If this definition were extended to more classes such that the prob- 
ability of an individual being in all n classes exactly equalled its least prob- 
ability for any class, the n classes form a cumulative or Guttman scale. 
Hence, the interpretation of weight as an empirical probability is subject 
to considerable restriction. 

This definition of class product was chosen for specific reasons, but 
before these can be outlined it is necessary to introduce a second operation. 


Class Inclusion 


The basic operation determining how the other operations should be 
defined is class inclusion. This operation defines the conditions under which 
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Figure 2 
Class Inclusion in Set Calculus (a included in b) 


one class is to be included within the other. In the regular set calculus in- 
clusion is defined, 


(5) aCb=p;a=afl\b, 


which is read “the class a is included in the class b,”’ equals by definition, 
“the class a is equal to the product (joint class) of a and b.” Figure 2 re- 
presents this diagrammatically. 

The problem then becomes one of defining inclusion in the calculus of 
empirical classes, analogous to the definition in the set calculus and such 
that it is equivalent to the set calculus in the extreme (i.e., when all ‘weights 
equal zero or one). This was solved by making the conditions for class in- 
clusion that, if a were included in b, the weight from every individual to 
a must be less than or equal to the weight from that individual to b. Figure 
3 represents this diagrammatically. In this diagram the weight (proportion) 
of each individual must be less (or equal) to a than to b since all weights to a 


are also to b. 
Hence the definition: Class Inclusion, 


(6) aC b =p; [W(A, a) = WA, a) d)], forall X. 


Since class product is defined in terms of lesser weights from each profile, 
a is included in b if and only if the weight to a from each profile is less than 


or equal to that to b. 
To follow our example, consider the classes Officer and Becomes Regular 














WILLIAM C. SCHUTZ 79 
























































Figure 3 
Class Inclusion in Empirical Class Calculus (a included in 6) 


Army Man (Regular). Officer includes Regular if for every profile it is more 
probable that a candidate with that profile becomes an officer than it is 
that he becomes a regular. 

In the familiar set calculus, a is included in b means, in effect, “If an 
individual. is a member of a he will surely be a member of b.” In the new 
CEC, a is included in b is made to mean, in effect, ‘Insofar as an individual 
is a member of a he will at least as surely be a member of b.” In more precise 
language, the combined result of the definition of class inclusion and class 
product is that if a is included in 6, the weight from every individual to a 
must be less than or equal to the weight from that individual to b. 

This, indeed, is the reason why the particular definition of class product 
was adopted in the CEC. For, as a result, the set calculus notion that ‘“‘Regular 
is included in Officer if every man that is a Regular is also an Officer” is 
true in the ideal case of the empirical class calculus where all the weights 
from candidate to Officer are 1 and some from candidate to Regular are 1 


and some are zero. 
Class Sum 


(7) fe = aU b] =n: [WA, ce) = WA, a) + WO, b) — WO, a d)] 


for all X. 
The weight from each profile to the sum of two classes is defined (as in 


the probability calculus) as the sum of the weights diminished by the weight 
of the class product. By the definitions of class product, this is equivalent 
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to taking the greater of the original weights 
(8) [W(A, a) or W(A, 6)). 


In the set calculus the class sum is simply all the individuals who are 
members of (have weight 1 to) either class. This is easily computed by adding 
the individuals who belong to each class and subtracting those who are 
members of both classes (and therefore who have been counted twice). By 
analogy, the empirical class calculus would say that the weight of an indi- 
vidual being either an Officer or Popular is the sum of the weight for being 
an Officer and for being Popular minus the weight of being both. But since 
being both (the class product) has a weight equal to the lesser of the two 
original weights, the class sum becomes: greater weight plus lesser weight 
minus lesser weight. Thus the weight for being either an officer or popular 
is equal to whichever weight is higher. The class sum is then defined as 
that class whose weight from ‘each profile is the greater of the weights to the 
original classes to that profile. 


Class Contrary 


(9) [b = —a] =p: [WQ, 6) + WQ, a) = 1], forall X. 


The contrary of a class is defined as the class to which the weight from 
each profile differs from unity by the weight to the original class. The in- 
teresting property of this definition is that it does not satisfy the law of 
the excluded middle (i.e., the law which states that an individual cannot 
belong to both a class and its contrary), since the weight from each profile 
to the class sum of a class and its contrary is only the larger of the two 
weights, not their sum (which is one). As the weighis to a class approach 
the extreme values of 0 and 1, the weights to the sum of the class and its 
contrary approach 1, until in the limiting case the law of the excluded middle 
is satisfied. 


Class Difference 


In the set calculus the difference between two sets is definable in terms 
of the set complement, i.e., 


(10) (a — p) =pr (2) — p) 


(~ minus p equals by definition a and not p). However, the corresponding 
formula in terms of the contrary of a class, a (\ —b, is not interpretable 
in general as a but not b since its weight from each profile is either that of 
a or of the contrary b, whichever is smaller. Hence class difference is defined 
as that class whose weight from any profile is equal to the difference in the 
weights from that profile to the original classes, respectively, except that 
it is 0 if that difference is not positive. 
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(11) {ec =a — b} =p, {W(A, 0c = max [W(A, a) — WA, b), 0]}, for all A. 


In the limiting case where the class b has weights of only 0 or 1, the class 
contrary is directly analogous to the set complement, and, as in the set 
calculus, a — b = af) —b. 

To illustrate the above operations for the psychological articulation, 
hypothetical values of the weights from each profile to the class sum, class 
product, class contrary, and class inclusion will be computed. Only five 
profiles will be used since they shall suffice for this example, (see Table 3). 


TABLE 3 


Hypothetical Values for Profile Weights to Various Class Operations 








Product. Sun Contrary Difference 


Profile W(Aj»a) W(Ay»d) WAy2a,b)  WOAysa pd) WOyr-a) = W(Avard) 








Al 0.89 0.81 0.81 0.89 0.11 0.08 
v2 0.61 0.03 0.03 0.61 0.39 0.58 
rA3 0.53 0.72 0.53 0.72 0.47 0.00 
A4 0.32 0.18 0.18 0.32 0.68 0.14 
v5 0.12 0.12 0.12 0.12 0.88 0.00 





An example of class inclusion is provided by the product and sum where 
the product is included in the sum, since for every profile the weight to 
the product is less than (or equal to) the weight to the sum. 

Class product and class difference explicate the application of this 
calculus to combinations of responses on a questionnaire. For example, assume 
membership in an empirical class is interpreted as giving a positive response 
to an item, and weight from a profile to a class is interpreted as the nominal 
probability (to distinguish it from the relative frequency probability) of an 
individual or type (the profile) giving a positive response to a given item 
(the class). Then the nominal probability of a positive response on two 
items (++) is the class product of the two weights, while the nominal 
probability of a positive response on the first item and a negative response 
on the second item (+ —) is equal to the class difference. Thus from the 
product and the difference the nominal probabilities of all possible response 
patterns could be computed. Again the isomorphism with the cumulative 
scale should be noted. 

This completes the fundamental definitions which allow for correspon- 
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dence (in the extreme) of this calculus for empirical classes to the usual 
set calculus. The purpose of the calculus so far is to transform the dichotomy 
of membership-nonmembership, basic to the set calculus, into a system with 
the fundamental properties and operations of the set calculus but incorpora- 
ting the notion of weights or probabilities to the relation between an indi- 
vidual and a class. This makes the set calculus a special case of the class 
calculus where the probability of any individual belonging to any class is 
either 0 or 1. 

Definitions of the four operations allow the development of a calculus of 
classes which corresponds in many respects to the usual calculus of sets. The 
following set calculus laws hold in this class calculus (with appropriate sub- 
stitution of symbols): the laws of commutativity and associativity for sum 
and product, DeMorgan’s theorems, the laws of tautology, the two distri- 
bution laws, the laws of double negation and of transposition and the transi- 
tivity of inclusion. 

One more concept is needed to complete the analogue to the set calculus. 
Define membership to the degree p, to be symbolized by e’, as the relation 
of an individual to a class when the weight from the profile characterizing 
the individual to the class is p. Class analogues can then be given for many 
propositions, for example, the principle of extensionality becomes, 


(12) [a = b] = [Xe’a = Xe’b], forall X. 


In other words, for two classes to be equivalent, all individuals must 
have membership to the same degree in each class. In our example Army 
Officers and Graduated Candidate School would in most cases be equivalent 
classes. 

K-S then proceed to develop this calculus to cover the topics of nominal 
and empirical probabilities, the reference function (the reverse of an indicator, 
namely the nominal probability that if a certain class exists, a certain profile 
will also exist), classes and correlations, and constructs and intervening 
variables. Space does not permit a detailed exposition of these topics but 
the reader is referred to the original article for a very stimulating develop- 
ment. 


An Application of the Calculus of Empirical Classes 
to Latent Structure Analysis 


It appears that latent structure analysis (LSA) pursues the same objec- 
tives as CEC, although the problem is approached quite differently. A com- 
parison of one of the results of these two approaches may prove profitable. 
Since both CEC and LSA are rather elaborate theoretical structures, the 
possible points of comparison are very numerous. One aspect will be selected 
for exploration—the implications of CEC for the assumption of local in- 
dependence in LSA. This is selected because (7) the local independence as- 
sumption is the fundamental one in LSA, (77) this point will illustrate the 
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value of comparing the approaches, and (777) the result of this particular 
analysis may be of substantive interest. 


CEC and Local Independence 


Using one interpretation of CEC, an alternative assumption to local 
independence is suggested. This assumption, called the assumption of a 
local scale, has the advantage that it allows the items and latent classes of 
LSA to be interpreted as classes and profiles, respectively, in CEC and 
therefore falls heir to all of the logical power of the set calculus. The implica- 
tions of this alternative assumption will be explored. 


A Latent Structure Interpretation of the Calculus of Empirical Classes 


Of the possible interpretations that could be made of the CEC, pro- 
files will be interpreted as latent classes, and classes will be interpreted as 
manifest items or observations. Of course, other interpretations of this 
strictly logical calculus are possible, e.g., a reversal of the above. This one, 
however, seems to have some interesting implications. The interpretation 
construes individuals as basically divisible into latent classes (profiles) which 
are usually not directly measured but only approximated by inference made 
from observable or manifest data (CEC classes). 

An interesting consequence of this interpretation is that latent proba- 
bilities become isomorphic with weights (assigned by the indicator function). 
This is convenient because the latent probabilities give the probabilities of 
a member of a given latent class accepting a given indicator. This is the same 
meaning the weights have in the CEC if profiles represent latent classes 
and classes represent items. Similarly, the recruitment probabilities of LSA 
correspond to the reference probabilities of CEC—both being the proba- 
bility of a certain latent class (or profile) given the item response (or class). 
An example should make this clear. 

For this illustration we shall construct a very complex example. As 
indicated in Table 4, assume eight latent classes (profiles) taken from our 
psychological articulation, and three dichotomous items as shown in Table 4. 
The cell entries present the probability that an individual with the given 
profile answers items a, b, and ¢, plus. 

This same data is represented in standard LSA form, in Table 5. The 
entries in CEC terms are W (dX, a) where \, takes on values of all profiles, 
or latent classes. R is the reference function. 

To compute the recruitment (LSA) or reference (CEC) probabilities 
divide the weight by the total probability a given item. These values are 
R (a, d;) and are given in Table 5. 


Class Product and Class Sum 


Using this interpretation one divergence between LSA and CEC appears 
with regard to the assumption regarding combination of probabilities. In 
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the class product, for example, which becomes isomorphic with LSA response 
patterns, consider the case of the response pattern +-+ for the latent class 
OHF. LSA would say the probability equals the product of the individual 
probabilities or .8(.3) = .24, while CEC would say that probability equals 
the lesser of the two individual probabilities, or .3. The probability of response 
pattern +++ for CEC = .100, for LSA = .024. In virtually every case 
the predictions would differ. Similarly, for class sum where LSA would say 
it is 8+ .3 — .8(.3) = .86 (for ++), CEC equates the probability of hav- 
ing + on etther item a or b to the larger, namely, .8. For +++, LSA gives 
8+ .3+ .1— .24 — .03 — .03 + 0.24 = .874, while CEC gives simply .8, 
the largest value. 

For class difference, a and not b (response pattern + —) there is also a 
discrepancy. CEC says that the probability of answering + to a, and — tob 
is equal to either the difference between W (A, a) and W (A, b) or zero, which- 
ever is larger, while LSA treats this like any other class product and would 
simply take the product of W (A, a) and 1 — W (A, b). Thus, for example, 
for latent class CLF, the probability of response pattern +— for CEC 
equals .9 — .2 = .7, while LSA puts this probability of +— at .9(.8) = .72. 
For CHN, + — would give by CEC .5 — .6 = 0. Therefore, the probability 
of +-— is zero, while for LSA it equals .5(.4) = .20. 

With the notions of class product and class difference it is possible to 
compute all values from the CEC scheme of the probabilities of each response 
pattern for each profile. In Table 6 we do this for the first two items and 
contrast these values with those obtained by LSA. 


The Local Scale Assumption 


From the above it is clear that the CEC derived model differs from the 
LSA model in that it does not assume local independence. It will be recalled 
that the assumption of local independence of LSA states that for every latent 
class X, and every pair of items 7 and j 


(13) Sees PPPS. 
The assumption being made in the CEC derived model is that 

(14) FPi,.:. = min 1P, , P;...1, 

(15) Piz... = max [P; — P; , 0]. 

That is, for any two items 7 and j, for members of a given latent class, the 
probability of accepting both items (++) equals the probability of accepting 
the lesser (rather than the product of the probabilities as in LSA), and the 
probability of accepting item 7, and not accepting item j (+—) equals the 


difference in the two probabilities, or zero, whichever is larger. These two 
equations characterize the local scale assumption. 


ll 
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The reason for this name stems from the fact that this assumption is 
true when the items form a cumulative scale for the members of a latent 
class. That is, the circumstance under which the local scale assumption holds 
is that the members of a latent class are all perfect scale types (within usual 
error) on the set of items from which the latent classes are derived. For 
example, consider a five-point cumulative scale and assume the proportion 
of respondents in each scale type (i.e., having the same response pattern) 


is equal. 
In Table 7 are the weights for each scale type for each item in a perfect 


TABLE 7 
Weights for Each Scale Type for Each Item in a Perfect Scale 











Items 
Scale Types A B c D E 
t+eeet 1 1 1 1 1 
++te- 1 1 1 1 fe) 
++4-- 1 1 1 0 0 
eee 1 1 t) t) 0) 
+-- 1 0 ° te) ie) 
=e ee 6 te) i) ° ie) i) 
Mean -83 -67 +50 +33 +17 





scale. To test the first assumption of a local scale, 
(16) Pi; = min [P; , P;], 


take any two items, say B and D. The local scale assumption would say 
the probability of this class accepting both B and D equals min [.67, .33] 
which is .33. Computing this probability for the above latent class, 


(17) (1+1+0+0+0+40)/6 = 2/6 = .33. 
This result will be the same for any two items. To test the second assumption, 
(18) Pi; = max [P; — P; , 0], 


let us take any two items, say A and LE. The probability of accepting A and 
rejecting H, according to the local scale assumption, equals max [.83 — .17, 0], 
which is .66. The computation of this probability for the above latent class is 


(19) O+1+1+1+1+0/6 = 4/6= 66. 
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The zero term in the equation covers the case of nonscale types, that is the 
probability of + on C and — on B, a —+ response, is negative, therefore 
zero. This is the formal statement that there are no nonscale types, or that 
the class includes only those individuals who are scale types. Hence the 
scale-type latent class fulfills the requirements of the local scale assumption. 
It is likely that this is the only type of model that does, though that proof 
has not yet been worked out. 

This assumption, which leads to a set of accounting equations, will 
not be pursued. The suggestion has been made by Lazarsfeld that this develop- 
ment could be pursued by studying the distribution of error. All that can 
be said at this time is that this appears to be a promising lead. The main 
point to be stressed here however is that LSA is based on the local independ- 
ence assumption, but the local scale assumption and in general the con- 
sideration of other assumptions might render LSA more flexible and lead 
to different latent classes that may be more appropriate in some investigations. 
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NOTE ON ¢/max 
Epwarp E. Cureton 
UNIVERSITY OF TENNESSEE 


Formulas are given for a descriptive statistic related to the fourfold- 
point correlation but having always-attainable limits + 1. 


The fourfold-point correlation coefficient, ¢, is derived from a table such 
as 








eo | SS. ae 
i Pi — Piz Pre Pi 
Y 
—- @& — ~i + Ps P2 — Pi2 
qe Pe 1 


where p, , 4; , P2 , and q, are the marginal proportions, and 7, is the joint 
positive proportion. Then 


(1) o = Pa Pips 


V Pi ViP292 


is the product-moment correlation between the dichotomous varieties X and 
¥. 

It is well known that ¢ can equal +1 only if p, = p. = .5, and that it 
can equal —1 only if p; = gq. = .5 ({1], p. 324: [2], p. 342). Hence ¢ has 
limits +1 only if all marginal proportions are .5. 

In order to obtain a descriptive statistic having always-attainable limits 
+1, it has been proposed that one use $/dmax ; ¢max being the maximum 
value of ¢ consistent with the given marginal proportions. It is apparent 
from the table that the maximum value of p,, is p, or p, , whichever is smaller 
(call it p’), and since the denominators of ¢ and ¢,,,. contain only the marginal 
proportions and must therefore be identical, 

o _ Piz — Pipa, 
(2) Pmax p — Pipe 
This formula applies if ¢ is positive. But for ¢ negative, ¢duax Should be the 
absolute value of the largest negative correlation consistent with the given 
marginals; this will occur when either p,. or g2 — p~: + Pi2 is 0, and the 
denominator of ¢/¢max is then p, p2. minus the smallest possible value of p,. . 
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If p,2. can take the value 0, the denominator is simply p,p, . If p,. cannot 
take the value 0, g. — p: + P12 can, in which case the smallest possible value 
of pi. is Pp; — Q . For ¢ negative, therefore, 








(3) ? ie Pi2 — Pipe 
dmax DPiP2 — (Pi — Qe)’ 
and the second term in the denominator is subtracted from p,p, only if 


D> Q- 
More convenient computing formulas are obtained if one multiplies 


numerator and denominator of (2) and (3) by N’: 
¢ NS,. — S,S, 











“ éj NS’ — 8,8, ’ @ positive; 
5 ¢ = N82 7 S, S. f 
©) max SiS, ~~ N(S, oe So oe N) ? ve) negative. 


Here S, and S, are the positive marginal frequencies, S,. the joint positive 
frequency, S’ = S, or S, , whichever is smaller, and the second term of the 
denominator of (5) is subtracted from S,S, only if S, + S, > N. 

For the alternative table,* 





a ea i 
a b S, 
yr 





N-S, 8S, N 
the numerators of (4) and (5) are equal to bc — ad. For ¢ positive, either a 
or d must be 0 when ¢ = dmax , SO 


@ _ be —ad a 
(6) 3... ve 7% Positive, 





and b’ and c’ are obtained from a fourfold having the same marginals but 
with a or d (whichever is smaller) replaced by 0. For ¢ negative, either b or c 
must be 0 when ¢ takes its maximum negative value, so 





@ — be—ad : 
(7) “aa a ,»@ negative, 
and a’ and d’ are obtained from a fourfold having the same marginals but 
with 6 or c (whichever is smaller) replaced by 0. 
The limitation of range of the ¢ coefficient is a special case. The product- 
moment correlation always has attainable limits + 1 if both variates are 


*This alternative development is based on an editorial suggestion. 
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continuous, or more generally if no two individuals have identical scores on 
either variate. With grouped or discrete distributions, its upper limit is + 1 
only if the two marginal distributions are identical, its lower limit is — 1 
only if one marginal distribution is identical with the other reversed, and its 
two limits are + 1 only if both marginal distributions are symmetrical and 
identical. 
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BOOK REVIEW 


CuiypE H. Coomss AND RicHarp C. Kao. Nonmetric Factor Analysis. Ann Arbor: Univer- 
sity of Michigan Engineering Research Institute Bulletin No. 38, 1955. Pp. vii + 63. 


Following a brief description of Coombs’ theory of data, this slender technical mono- 
graph treats two classes of models for the resolution of behavior into components. These 
models, conjunctive-disjunctive and compensatory, are intended for scaling dichotomous 
monotone items collected by the method of single stimuli. The models are not competitive 
with Guttman scalogram analysis but are offered as multidimensional formulations which 
might be appropriate when unidimensional solutions are not obtained. A general axiomatic 
basis for the approach is presented, and theorems specific to each model are derived sepa- 
rately. Some very helpful interpretive material is interspersed throughout the mathe- 
matical development. 

In the conjunctive model success on a task requires a certain minimum on each of 
the relevant dimensions; in the disjunctive model success requires a minimum on any 
one of the dimensions involved. The conjunctive and disjunctive models, although psycho- 
logically distinct, are shown to be formally equivalent and are treated simultaneously 
throughout the development. In the compensatory model, which underlies multiple-re- 
gression theory and multiple-factor analysis, a given performance may be achieved by 
many different combinations of components: an excess in one attribute may compensate 
for a shortage in another. The logic of the models, then, would call for the use of multiple 
cutting scores for a conjunctive domain and multiple regression for a compensatory domain. 

The general model is constructed by representing both individuals and stimuli as 
vectors in a multidimensional space, called the genotypic space. An individual’s response 
to a stimulus is treated formally as an order relation on the corresponding individual vector 
and stimulus vector. Theorems are derived to generate a calculus for recovering the under- 
lying genotypic structure from the data. In the conjunctive model the measures in the 
genotypic space are assumed to be elements of only an ordinal scale, and the analysis 
recovers the several components at the level of partial orders or simple orders. A method 
is also described for estimating a lower bound of the dimensionality. 

In the compensatory model the genotypic structure is strengthened to the level of 
real numbers, but the additional assumptions required to recover the elements at that 
level are not made, so that factor loadings are obtained on an ordinal scale. Both an indi- 
vidual compensatory and a stimulus compensatory model are constructed, depending on 
whether the linear function which weights the several components is characteristic of an 
individual and independent of the stimuli, or vice versa. Only the rudiments of the com- 
pensatory model have been formulated. Procedures have been developed only for the special 
case of two dimensions, and the information recovered in the analysis is the order of the 
factor loadings (cosine of the angle between vector and coordinate axis), which does not 
correspond to the order of the projections. 

These scaling models face several problems which are anticipated by Coombs and 
Kao in their discussion section. The problem of error is not considered in these completely 
deterministic models, and no procedures are offered for its treatment. The authors are 
aware, however, that “. .. the data will contain error, and some sort of stochastic generali- 
zations will have to be constructed to make these models useful. . . .’’ Another problem 
arises with respect to uniqueness in handling incomplete data; Milholland has found that 
alternative solutions can be obtained in the conjunctive-disjunctive model at least. Another 
important concern involves criteria for choosing between the models for application to 
data. The authors suggest that an appropriate model would require no more dimensions 
than an inappropriate one and would generate less nonexistent data. 

Several advantages may accrue from a mathematically precise formulation of a 
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measurement model, not the least of which is the clarity gained from an explicit statement 
of assumptions and inferences. For example, the precise formalization of a model might 
stimulate theory construction in measurement by providing the foundation for further 
generalizations and by making it possible to compare the exact properties of different 
methods. A more important contribution is sometimes made, however, in that the models 
underlying various measurement methods frequently have implications for theorizing in 
psychology. Some such implications, a few of which are briefly illustrated in the Intro- 
duction, are attendant upon the distinctions made by Coombs and Kao in formulating 
the models of Nonmetric Factor Analysis. These implications may be derived in spite of 
the fact that statistical machinery has not yet been developed for practical applications 
of the models. 

The elegance and precision gained through the axiomatic presentation has an ac- 
companying difficulty of reading, so that the audience for this technical monograph seems 
limited to measurement specialists interested in the theory of models. 
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